决策树Gini系数计算过程详细解答 |
您所在的位置:网站首页 › 节点的自导系数怎么计算 › 决策树Gini系数计算过程详细解答 |
最近看了篇文章,关于决策树的基尼系数计算过程,很详细,也很完整; 文章出处:https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/ 收藏记录一下。 An algorithm can be transparent only if its decisions can be read and understood by people clearly. Even though deep learning is superstar of machine learning nowadays, it is an opaque algorithm and we do not know the reason of decision. Herein, Decision tree algorithms still keep their popularity because they can produce transparent decisions. ID3 uses information gain whereas C4.5 uses gain ratio for splitting. Here, CART is an alternative decision tree building algorithm. It can handle both classification and regression tasks. This algorithm uses a new metric named gini index to create decision points for classification tasks. We will mention a step by step CART decision tree example by hand from scratch. Wizard of Oz (1939) We will work on same dataset in ID3. There are 14 instances of golf playing decisions based on outlook, temperature, humidity and wind factors. DayOutlookTemp.HumidityWindDecision1SunnyHotHighWeakNo2SunnyHotHighStrongNo3OvercastHotHighWeakYes4RainMildHighWeakYes5RainCoolNormalWeakYes6RainCoolNormalStrongNo7OvercastCoolNormalStrongYes8SunnyMildHighWeakNo9SunnyCoolNormalWeakYes10RainMildNormalWeakYes11SunnyMildNormalStrongYes12OvercastMildHighStrongYes13OvercastHotNormalWeakYes14RainMildHighStrongNo Gini index Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each class. We can formulate it as illustrated below. Gini = 1 – Σ (Pi)2 for i=1 to number of classes OutlookOutlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for outlook feature. OutlookYesNoNumber of instancesSunny235Overcast404Rain325Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48 Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0 Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48 Then, we will calculate weighted sum of gini indexes for outlook feature. Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342 TemperatureSimilarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild. Let’s summarize decisions for temperature feature. TemperatureYesNoNumber of instancesHot224Cool314Mild426Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5 Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375 Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445 We’ll calculate weighted sum of gini index for temperature feature Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439 HumidityHumidity is a binary class feature. It can be high or normal. HumidityYesNoNumber of instancesHigh347Normal617Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489 Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244 Weighted sum for humidity feature will be calculated next Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367 WindWind is a binary class similar to humidity. It can be weak and strong. WindYesNoNumber of instancesWeak628Strong336Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375 Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5 Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428 Time to decideWe’ve calculated gini index values for each feature. The winner will be outlook feature because its cost is the lowest. FeatureGini indexOutlook0.342Temperature0.439Humidity0.367Wind0.428We’ll put outlook decision at the top of the tree. First decision would be outlook feature You might realize that sub dataset in the overcast leaf has only yes decisions. This means that overcast leaf is over. Tree is over for overcast outlook leaf We will apply same principles to those sub datasets in the following steps. Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature, humidity and wind features respectively. DayOutlookTemp.HumidityWindDecision1SunnyHotHighWeakNo2SunnyHotHighStrongNo8SunnyMildHighWeakNo9SunnyCoolNormalWeakYes11SunnyMildNormalStrongYes Gini of temperature for sunny outlook TemperatureYesNoNumber of instancesHot022Cool101Mild112Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0 Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0 Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5 Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2 Gini of humidity for sunny outlook HumidityYesNoNumber of instancesHigh033Normal202Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0 Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0 Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0 Gini of wind for sunny outlook WindYesNoNumber of instancesWeak123Strong112Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266 Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2 Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466 Decision for sunny outlookWe’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity because it has the lowest value. FeatureGini indexTemperature0.2Humidity0Wind0.466We’ll put humidity check at the extension of sunny outlook. Sub datasets for high and normal humidity As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over. Decisions for high and normal humidity Now, we need to focus on rain outlook. Rain outlook DayOutlookTemp.HumidityWindDecision4RainMildHighWeakYes5RainCoolNormalWeakYes6RainCoolNormalStrongNo10RainMildNormalWeakYes14RainMildHighStrongNoWe’ll calculate gini index scores for temperature, humidity and wind features when outlook is rain. Gini of temprature for rain outlook TemperatureYesNoNumber of instancesCool112Mild213Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5 Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444 Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466 Gini of humidity for rain outlook HumidityYesNoNumber of instancesHigh112Normal213Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5 Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444 Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466 Gini of wind for rain outlook WindYesNoNumber of instancesWeak303Strong022Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0 Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0 Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0 Decision for rain outlookThe winner is wind feature for rain outlook because it has the minimum gini index score in features. FeatureGini indexTemperature0.466Humidity0.466Wind0Put the wind feature for rain outlook branch and monitor the new sub data sets. Sub data sets for weak and strong wind and rain outlook As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over. Final form of the decision tree built by CART algorithm So, decision tree building is over. We have built a decision tree by hand. BTW, you might realize that we’ve created exactly the same tree in ID3 example. This does not mean that ID3 and CART algorithms produce same trees always. We are just lucky. Finally, I believe that CART is easier than ID3 and C4.5, isn’t it? |
今日新闻 |
点击排行 |
|
推荐新闻 |
图片新闻 |
|
专题文章 |
CopyRight 2018-2019 实验室设备网 版权所有 win10的实时保护怎么永久关闭 |