python實現決策樹分類算法代碼示例

Posted on 2022-06-11 by WalkonNet

前置信息

1、決策樹

決策樹是一種十分常用的分類算法，屬於監督學習；也就是給出一批樣本，每個樣本都有一組屬性和一個分類結果。算法通過學習這些樣本，得到一個決策樹，這個決策樹能夠對新的數據給出合適的分類

2、樣本數據

假設現有用戶14名，其個人屬性及是否購買某一產品的數據如下：

編號	年齡	收入范圍	工作性質	信用評級	購買決策
01	<30	高	不穩定	較差	否
02	<30	高	不穩定	好	否
03	30-40	高	不穩定	較差	是
04	>40	中等	不穩定	較差	是
05	>40	低	穩定	較差	是
06	>40	低	穩定	好	否
07	30-40	低	穩定	好	是
08	<30	中等	不穩定	較差	否
09	<30	低	穩定	較差	是
10	>40	中等	穩定	較差	是
11	<30	中等	穩定	好	是
12	30-40	中等	不穩定	好	是
13	30-40	高	穩定	較差	是
14	>40	中等	不穩定	好	否

策樹分類算法

1、構建數據集

為瞭方便處理，對模擬數據按以下規則轉換為數值型列表數據：

年齡：<30賦值為0；30-40賦值為1；>40賦值為2

收入：低為0；中為1；高為2

工作性質：不穩定為0；穩定為1

信用評級：差為0；好為1

#創建數據集
def createdataset():
    dataSet=[[0,2,0,0,'N'],
            [0,2,0,1,'N'],
            [1,2,0,0,'Y'],
            [2,1,0,0,'Y'],
            [2,0,1,0,'Y'],
            [2,0,1,1,'N'],
            [1,0,1,1,'Y'],
            [0,1,0,0,'N'],
            [0,0,1,0,'Y'],
            [2,1,1,0,'Y'],
            [0,1,1,1,'Y'],
            [1,1,0,1,'Y'],
            [1,2,1,0,'Y'],
            [2,1,0,1,'N'],]
    labels=['age','income','job','credit']
    return dataSet,labels

調用函數，可獲得數據：

ds1,lab = createdataset()
print(ds1)
print(lab)

[[0, 2, 0, 0, ‘N’], [0, 2, 0, 1, ‘N’], [1, 2, 0, 0, ‘Y’], [2, 1, 0, 0, ‘Y’], [2, 0, 1, 0, ‘Y’], [2, 0, 1, 1, ‘N’], [1, 0, 1, 1, ‘Y’], [0, 1, 0, 0, ‘N’], [0, 0, 1, 0, ‘Y’], [2, 1, 1, 0, ‘Y’], [0, 1, 1, 1, ‘Y’], [1, 1, 0, 1, ‘Y’], [1, 2, 1, 0, ‘Y’], [2, 1, 0, 1, ‘N’]]
[‘age’, ‘income’, ‘job’, ‘credit’]

2、數據集信息熵

信息熵也稱為香農熵，是隨機變量的期望。度量信息的不確定程度。信息的熵越大，信息就越不容易搞清楚。處理信息就是為瞭把信息搞清楚，就是熵減少的過程。

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        
        labelCounts[currentLabel] += 1            
        
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*log(prob,2)
    
    return shannonEnt

樣本數據信息熵：

shan = calcShannonEnt(ds1)
print(shan)

0.9402859586706309

3、信息增益

信息增益：用於度量屬性A降低樣本集合X熵的貢獻大小。信息增益越大，越適於對X分類。

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0;bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntroy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prop = len(subDataSet)/float(len(dataSet))
            newEntroy += prop * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntroy
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i    
    return bestFeature

以上代碼實現瞭基於信息熵增益的ID3決策樹學習算法。其核心邏輯原理是：依次選取屬性集中的每一個屬性，將樣本集按照此屬性的取值分割為若幹個子集；對這些子集計算信息熵，其與樣本的信息熵的差，即為按照此屬性分割的信息熵增益；找出所有增益中最大的那一個對應的屬性，就是用於分割樣本集的屬性。

計算樣本最佳的分割樣本屬性，結果顯示為第0列，即age屬性：

col = chooseBestFeatureToSplit(ds1)
col

0

4、構造決策樹

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classList.iteritems(),key=operator.itemgetter(1),reverse=True)#利用operator操作鍵值排序字典
    return sortedClassCount[0][0]

#創建樹的函數    
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
        
    return myTree

majorityCnt函數用於處理一下情況：最終的理想決策樹應該沿著決策分支到達最底端時，所有的樣本應該都是相同的分類結果。但是真實樣本中難免會出現所有屬性一致但分類結果不一樣的情況，此時majorityCnt將這類樣本的分類標簽都調整為出現次數最多的那一個分類結果。

createTree是核心任務函數，它對所有的屬性依次調用ID3信息熵增益算法進行計算處理，最終生成決策樹。

5、實例化構造決策樹

利用樣本數據構造決策樹：

Tree = createTree(ds1, lab)
print("樣本數據決策樹：")
print(Tree)

樣本數據決策樹：
{‘age’: {0: {‘job’: {0: ‘N’, 1: ‘Y’}},
1: ‘Y’,
2: {‘credit’: {0: ‘Y’, 1: ‘N’}}}}

6、測試樣本分類

給出一個新的用戶信息，判斷ta是否購買某一產品：

年齡	收入范圍	工作性質	信用評級
<30	低	穩定	好
<30	高	不穩定	好

def classify(inputtree,featlabels,testvec):
    firststr = list(inputtree.keys())[0]
    seconddict = inputtree[firststr]
    featindex = featlabels.index(firststr)
    for key in seconddict.keys():
        if testvec[featindex]==key:
            if type(seconddict[key]).__name__=='dict':
                classlabel=classify(seconddict[key],featlabels,testvec)
            else:
                classlabel=seconddict[key]
    return classlabel

labels=['age','income','job','credit']
tsvec=[0,0,1,1]
print('result:',classify(Tree,labels,tsvec))
tsvec1=[0,2,0,1]
print('result1:',classify(Tree,labels,tsvec1))

result: Y
result1: N

後置信息：繪制決策樹代碼

以下代碼用於繪制決策樹圖形，非決策樹算法重點，有興趣可參考學習

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

#獲取葉節點的數目
def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#測試節點的數據是否為字典，以此判斷是否為葉節點
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

#獲取樹的層數
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#測試節點的數據是否為字典，以此判斷是否為葉節點
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

#繪制節點
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

#繪制連接線  
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

#繪制樹結構  
def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

#創建決策樹圖形    
def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.savefig('決策樹.png',dpi=300,bbox_inches='tight')
    plt.show()

總結

到此這篇關於python實現決策樹分類算法的文章就介紹到這瞭,更多相關python決策樹分類算法內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

python實現決策樹分類算法代碼示例

目錄

前置信息

1、決策樹

2、樣本數據

策樹分類算法

1、構建數據集

2、數據集信息熵

3、信息增益

4、構造決策樹

5、實例化構造決策樹

6、測試樣本分類

後置信息：繪制決策樹代碼

總結

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

前置信息

1、決策樹

2、樣本數據

策樹分類算法

1、構建數據集

2、數據集信息熵

3、信息增益

4、構造決策樹

5、實例化構造決策樹

6、測試樣本分類

後置信息：繪制決策樹代碼

總結

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆