西瓜书《机器学习》第七章部分课后题 您所在的位置:网站首页 python课后题答案第七章 西瓜书《机器学习》第七章部分课后题

西瓜书《机器学习》第七章部分课后题

#西瓜书《机器学习》第七章部分课后题| 来源: 网络整理| 查看: 265

目录 题目7.1题目7.3粗糙版本改进版本 题目7.4题目7.7Acknowledge

题目7.1

试使用极大似然法估算西瓜数据集3.0中前3个属性的类条件概率。

对于属性色泽,根据表4.3,有

好瓜:3个青绿,4个乌黑,1个浅白坏瓜:3个青绿,2个乌黑,4个浅白

设 P ( 青 绿 ∣ 好 瓜 ) = ζ 1 P(青绿|好瓜) = \zeta_1 P(青绿∣好瓜)=ζ1​, P ( 乌 黑 ∣ 好 瓜 ) = ζ 2 P(乌黑|好瓜) = \zeta_2 P(乌黑∣好瓜)=ζ2​, P ( 浅 白 ∣ 好 瓜 ) = 1 − ζ 1 − ζ 2 P(浅白|好瓜) = 1 - \zeta_1 - \zeta_2 P(浅白∣好瓜)=1−ζ1​−ζ2​,由式(7.9)有 L ( D 好 瓜 ∣ θ 好 瓜 ) = ∏ x ∈ D 好 瓜 P ( x ∣ θ 好 瓜 ) = ζ 1 3 ⋅ ζ 2 4 ⋅ ( 1 − ζ 1 − ζ 2 ) 1 , L(D_{好瓜} | \theta_{好瓜}) = \prod_{\bm{x} \in D_{好瓜}} P(\bm{x} | \theta_{好瓜}) = \zeta_1^3 \cdot \zeta_2^4 \cdot (1 - \zeta_1 - \zeta_2)^1, L(D好瓜​∣θ好瓜​)=x∈D好瓜​∏​P(x∣θ好瓜​)=ζ13​⋅ζ24​⋅(1−ζ1​−ζ2​)1, 则 L L ( θ 好 瓜 ) = l o g   P ( D 好 瓜 ∣ θ 好 瓜 ) = 3 l o g   ζ 1 + 4 l o g   ζ 2 + l o g   ( 1 − ζ 1 − ζ 2 ) LL(\theta_{好瓜}) = log ~ P(D_{好瓜} | \theta_{好瓜}) = 3log ~ \zeta_1 + 4log ~ \zeta_2 + log ~ (1 - \zeta_1 - \zeta_2) LL(θ好瓜​)=log P(D好瓜​∣θ好瓜​)=3log ζ1​+4log ζ2​+log (1−ζ1​−ζ2​),分别对 ζ 1 \zeta_1 ζ1​、 ζ 2 \zeta_2 ζ2​求偏导数为0时的值,有 ζ 1 = 3 / 8 \zeta_1 = 3/8 ζ1​=3/8, ζ 2 = 4 / 8 \zeta_2 = 4/8 ζ2​=4/8和 ζ 3 = 1 / 8 \zeta_3 = 1/8 ζ3​=1/8。

设 P ( 青 绿 ∣ 坏 瓜 ) = η 1 P(青绿|坏瓜) = \eta_1 P(青绿∣坏瓜)=η1​, P ( 乌 黑 ∣ 坏 瓜 ) = η 2 P(乌黑|坏瓜) = \eta_2 P(乌黑∣坏瓜)=η2​, P ( 浅 白 ∣ 坏 瓜 ) = 1 − η 1 − η 2 P(浅白|坏瓜) = 1 - \eta_1 - \eta_2 P(浅白∣坏瓜)=1−η1​−η2​,由式(7.9)有 L ( D 坏 瓜 ∣ θ 坏 瓜 ) = ∏ x ∈ D 坏 瓜 P ( x ∣ θ 坏 瓜 ) = η 1 3 ⋅ η 2 2 ⋅ ( 1 − η 1 − η 2 ) 4 , L(D_{坏瓜} | \theta_{坏瓜}) = \prod_{\bm{x} \in D_{坏瓜}} P(\bm{x} | \theta_{坏瓜}) = \eta_1^3 \cdot \eta_2^2 \cdot (1 - \eta_1 - \eta_2)^4, L(D坏瓜​∣θ坏瓜​)=x∈D坏瓜​∏​P(x∣θ坏瓜​)=η13​⋅η22​⋅(1−η1​−η2​)4, 则 L L ( θ 坏 瓜 ) = l o g   P ( D 坏 瓜 ∣ θ 坏 瓜 ) = 3 l o g   η 1 + 2 l o g   η 2 + 4 l o g   ( 1 − η 1 − η 2 ) LL(\theta_{坏瓜}) = log ~ P(D_{坏瓜} | \theta_{坏瓜}) = 3log ~ \eta_1 + 2log ~ \eta_2 + 4log ~ (1 - \eta_1 - \eta_2) LL(θ坏瓜​)=log P(D坏瓜​∣θ坏瓜​)=3log η1​+2log η2​+4log (1−η1​−η2​),分别对 η 1 \eta_1 η1​、 η 2 \eta_2 η2​和 η 3 \eta_3 η3​求偏导数为0时的值,有 η 1 = 3 / 9 \eta_1 = 3/9 η1​=3/9, η 2 = 2 / 9 \eta_2 = 2/9 η2​=2/9和 η 3 = 4 / 9 \eta_3 = 4/9 η3​=4/9。

上述结果与直接观测一致,即式(7.17),条件概率 P ( x i ∣ c ) P(x_i | c) P(xi​∣c)可估计为 P ( x i ∣ c ) = ∣ D c , x i ∣ ∣ D c ∣ 。 P(x_i | c) = \frac{|D_{c, x_i}|}{|D_c|}。 P(xi​∣c)=∣Dc​∣∣Dc,xi​​∣​。

题目7.3

试编程实现拉普拉斯修正的朴素贝叶斯分类器,并以西瓜数据集3.0为训练集,对p.151“测1”样本进行判别。

粗糙版本 import math import numpy as np data_ = [ ['青绿','蜷缩','浊响','清晰','凹陷','硬滑',0.697,0.460,'是'], ['乌黑','蜷缩','沉闷','清晰','凹陷','硬滑',0.774,0.376,'是'], ['乌黑','蜷缩','浊响','清晰','凹陷','硬滑',0.634,0.264,'是'], ['青绿','蜷缩','沉闷','清晰','凹陷','硬滑',0.608,0.318,'是'], ['浅白','蜷缩','浊响','清晰','凹陷','硬滑',0.556,0.215,'是'], ['青绿','稍蜷','浊响','清晰','稍凹','软粘',0.403,0.237,'是'], ['乌黑','稍蜷','浊响','稍糊','稍凹','软粘',0.481,0.149,'是'], ['乌黑','稍蜷','浊响','清晰','稍凹','硬滑',0.437,0.211,'是'], ['乌黑','稍蜷','沉闷','稍糊','稍凹','硬滑',0.666,0.091,'否'], ['青绿','硬挺','清脆','清晰','平坦','软粘',0.243,0.267,'否'], ['浅白','硬挺','清脆','模糊','平坦','硬滑',0.245,0.057,'否'], ['浅白','蜷缩','浊响','模糊','平坦','软粘',0.343,0.099,'否'], ['青绿','稍蜷','浊响','稍糊','凹陷','硬滑',0.639,0.161,'否'], ['浅白','稍蜷','沉闷','稍糊','凹陷','硬滑',0.657,0.198,'否'], ['乌黑','稍蜷','浊响','清晰','稍凹','软粘',0.360,0.370,'否'], ['浅白','蜷缩','浊响','模糊','平坦','硬滑',0.593,0.042,'否'], ['青绿','蜷缩','沉闷','稍糊','稍凹','硬滑',0.719,0.103,'否'], ] is_discrete = [True] * 6 + [False] * 2 set_list = [set() for i in range(8)] for d in data_: for i in range(8): set_list[i].add(d[i]) features_list = [[] for i in range(8)] for i in range(8): features_list[i] = list(set_list[i]) data = np.mat(data_) labels = np.unique(data[:, -1].A) cnt_labels = [0] * len(labels) for i in range(data.shape[0]): if data[i, -1] == labels[0]: cnt_labels[0] = cnt_labels[0] + 1 elif data[i, -1] == labels[1]: cnt_labels[1] = cnt_labels[1] + 1 def train_discrete(data, labels, cnt_labels, features_list, xi): prob = np.ones([len(labels), np.unique(data[:, xi].A).shape[0]]) for i in range(data.shape[0]): tmp = features_list[xi].index(data[i, xi]) if data[i, -1] == labels[0]: prob[0, tmp] = prob[0, tmp] + 1 elif data[i, -1] == labels[1]: prob[1, tmp] = prob[1, tmp] + 1 for i in range(len(labels)): prob[i] = prob[i] / (cnt_labels[i] + len(features_list[xi])) return prob def train_continuous(data, labels, xi): vec0, vec1 = [], [] for i in range(data.shape[0]): if data[i, -1] == labels[0]: vec0.append(data[i, xi]) elif data[i, -1] == labels[1]: vec1.append(data[i, xi]) vec0, vec1 = np.array(vec0).astype(float), np.array(vec1).astype(float) u0, u1 = np.mean(vec0), np.mean(vec1) s0, s1 = np.var(vec0), (np.var(vec1)) return np.mat([[u0, s0], [u1, s1]]) param = [] for i in range(8): if is_discrete[i]: param.append(train_discrete(data, labels, cnt_labels, features_list, i)) else: param.append(train_continuous(data, labels, i)) p0 = (cnt_labels[0] + 1) / (len(data_) + 2) p1 = (cnt_labels[1] + 1) / (len(data_) + 2) d = data_[0] for i in range(len(d) - 1): if is_discrete[i]: ind = features_list[i].index(d[i]) p0 *= param[i][0, ind] p1 *= param[i][1, ind] else: p0 *= 1 / (math.sqrt(2 * math.pi * param[i][0, 1])) * math.exp(-(d[i] - param[i][0, 0])**2 / (2 * param[i][0,1])) p1 *= 1 / (math.sqrt(2 * math.pi * param[i][1, 1])) * math.exp(-(d[i] - param[i][1, 0]) ** 2 / (2 * param[i][1, 1])) print(p0, p1) if p0 > p1: print(labels[0]) else: print(labels[1]) print() err = 0 for d in data_: p0 = (cnt_labels[0] + 1) / (len(data_) + 2) p1 = (cnt_labels[1] + 1) / (len(data_) + 2) for i in range(len(d) - 1): if is_discrete[i]: ind = features_list[i].index(d[i]) p0 *= param[i][0, ind] p1 *= param[i][1, ind] else: p0 *= 1 / (math.sqrt(2 * math.pi * param[i][0, 1])) * math.exp( -(d[i] - param[i][0, 0]) ** 2 / (2 * param[i][0, 1])) p1 *= 1 / (math.sqrt(2 * math.pi * param[i][1, 1])) * math.exp( -(d[i] - param[i][1, 0]) ** 2 / (2 * param[i][1, 1])) plabel = None if p0 > p1: plabel = labels[0] else: plabel = labels[1] if plabel != d[-1]: err += 1 print(1 - err / len(data_))

训练误差为0.1765。本题对“测1”的计算结果与链接中所给的计算结果一致,需要留意的是numpy与pandas的var函数有些区别,pandas在计算方差时分母为 N − 1 N-1 N−1,pandas设置var(ddof=0)时链接中的代码与本题代码所给对数似然概率一致,pandas的var函数ddof参数默认值为1。

改进版本 import numpy as np import pandas as pd from sklearn.utils.multiclass import type_of_target from collections import namedtuple import json columns_ = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感', '密度', '含糖率', '好瓜'] data_ = [ ['青绿','蜷缩','浊响','清晰','凹陷','硬滑',0.697,0.460,'是'], ['乌黑','蜷缩','沉闷','清晰','凹陷','硬滑',0.774,0.376,'是'], ['乌黑','蜷缩','浊响','清晰','凹陷','硬滑',0.634,0.264,'是'], ['青绿','蜷缩','沉闷','清晰','凹陷','硬滑',0.608,0.318,'是'], ['浅白','蜷缩','浊响','清晰','凹陷','硬滑',0.556,0.215,'是'], ['青绿','稍蜷','浊响','清晰','稍凹','软粘',0.403,0.237,'是'], ['乌黑','稍蜷','浊响','稍糊','稍凹','软粘',0.481,0.149,'是'], ['乌黑','稍蜷','浊响','清晰','稍凹','硬滑',0.437,0.211,'是'], ['乌黑','稍蜷','沉闷','稍糊','稍凹','硬滑',0.666,0.091,'否'], ['青绿','硬挺','清脆','清晰','平坦','软粘',0.243,0.267,'否'], ['浅白','硬挺','清脆','模糊','平坦','硬滑',0.245,0.057,'否'], ['浅白','蜷缩','浊响','模糊','平坦','软粘',0.343,0.099,'否'], ['青绿','稍蜷','浊响','稍糊','凹陷','硬滑',0.639,0.161,'否'], ['浅白','稍蜷','沉闷','稍糊','凹陷','硬滑',0.657,0.198,'否'], ['乌黑','稍蜷','浊响','清晰','稍凹','软粘',0.360,0.370,'否'], ['浅白','蜷缩','浊响','模糊','平坦','硬滑',0.593,0.042,'否'], ['青绿','蜷缩','沉闷','稍糊','稍凹','硬滑',0.719,0.103,'否'], ] labels = ['是', '否'] def Train_nb(x, y, labels, columns): p_size = len(x) p_labels = [] x_c = [] for label in labels: tx_c = x[y == label] x_c.append(tx_c) p_labels.append(len(tx_c)) p_xi_cs = [] PItem = namedtuple("PItem", ['is_continuous', 'data', 'n_i']) for i in range(len(x_c)): d_c = x_c[i] # d_c即D_c p_xi_c = [] for column in columns[:-1]: # 遍历所有属性,除了“好瓜”一列 d_c_col = d_c.loc[:, column] # 取出类别为c的数据D_c中对应column属性的一列 if type_of_target(d_c_col) == 'continuous': # 连续值属性 imean = np.mean(d_c_col) ivar = np.var(d_c_col) p_xi_c.append(PItem(True, [imean, ivar], None)) else: n_i = len(pd.unique(x.loc[:, column])) # 该列属性可能的取值数 p_xi_c.append(PItem(False, pd.value_counts(d_c_col).to_json(), n_i)) # python的广播处理 p_xi_cs.append(p_xi_c) return p_size, p_labels, p_xi_cs def Predict_nb(sample, labels, p_size, p_labels, p_xi_cs): res = None p_best = 0 for i in range(len(labels)): p_tmp = np.log((p_labels[i] + 1) / (p_size + len(labels))) p_xi_c = p_xi_cs[i] for j in range(len(sample)): pitem = p_xi_c[j] if not pitem.is_continuous: jdata = json.loads(pitem.data) if sample[j] in jdata: p_tmp += np.log((jdata[sample[j]] + 1) / (p_labels[i] + pitem.n_i)) else: p_tmp += np.log(1 / (p_labels[i] + pitem.n_i)) else: [imean, ivar] = pitem.data p_tmp += np.log(1 / np.sqrt(2 * np.pi * ivar) * np.exp(- (sample[j] - imean) ** 2 / (2 * ivar))) if i == 0: res = labels[i] p_bes = p_tmp elif p_bes


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

      专题文章
        CopyRight 2018-2019 实验室设备网 版权所有