Decision Tree#

Note

Decision tree split node by one feature at each step.
ID3 split according to information gain.
C4.5 split according to information gain ratio.
CART split according to square error & gini.
Pruning tries to minimize leaves entropy and model’s complexity at the same time.

Information Gain ID3#

Suppose we have a classification problem, dataset \(D=\{(x^{(1)}, y^{(1)}),...,(x^{(n)}, y^{(n)})\}\), \(y^{(i)} \in \left\{1,...,k\right\}\).

Then the entropy of \(D\) which measures uncertainty:

\[H(D) = E(-\log{p_{i}}) = -\sum_{i=1}^{k}p_{i}\log{p_{i}}\]
entropy

Assume we partition \(D\) according to feature \(A\) into \(D_{1},...,D_{m}\), then the entropy of \(D\) after knowing \(A\):

\[H(D|A)=\sum_{i=1}^{m}\frac{\#D_{i}}{\#D}H(D_{i})\]

Information gain is uncertainty loss:

\[g(D,A) = H(D) - H(D|A)\]

Decision Tree ID3 choose feature \(A\) that maximize \(g(D,A)\) until:

  1. subset is empty

  2. information gain \(g(D,A)\le\epsilon\)

Information Gain Ratio C4.5#

Information gain prefer feature \(A\) such that \(\#A\) is large, more precisely,

\[H_{A}(D) =-\sum_{i=1}^{n}\frac{\#D_{i}}{\#D}\log\frac{\#D_{i}}{\#D}\]

is large.

Information gain ratio fix this by dividing \(H_{A}(D)\):

\[g_{R}(D,A)=\frac{g(D,A)}{H_{A}(D)}\]

CART-classification and regression tree#

For regression problem, we try to find feature \(j\) and cutting point \(s\) that minimize the square error:

\[\underset{j,s}{min}\left[\underset{c_{1}}{min}\sum_{x_{i} \in R_{1}(j, s)}(y_{i} - c_{1})^{2} + \underset{c_{2}}{min}\sum_{x_{i} \in R_{2}(j, s)}(y_{i} - c_{2})^{2}\right]\]

For classification problem CART uses gini:

\[\mbox{Gini}(D) = E(1 - p_{i}) = \sum_{i=1}^{k}p_{i}(1 - p_{i})\]
entropy_gini

Pruning#

Total entropy of these leaves:

\[C(T) = \sum_{t=1}^{\left | T \right |} \left | T_{t} \right |H(T_{t})\]

We want to minimize this entropy, and at the same time limit model’s complexity.

Loss function:

\[C_{\alpha}(T) = C(T) + \alpha\left | T \right |\]

If pruning a node result in a decrease of the loss function, then pruning the node.

Examples#

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, 2:]  # petal length and width
y = iris.target
"""hyperparameters"""
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(criterion="entropy",  # default gini
                                  # maximum depth of that tree
                                  max_depth=3,
                                  # maximum number of leaf nodes
                                  max_leaf_nodes=15,
                                  # maximum number of features when splitting each node
                                  max_features=2,
                                  # min number of samples of a node before it can split
                                  min_samples_split=8,
                                  # min number of samples of a leaf node
                                  min_samples_leaf=3,
                                  # same as min_samples_leaf, but by weight frac
                                  min_weight_fraction_leaf=0.01)
tree_clf.fit(X, y)
DecisionTreeClassifier(criterion='entropy', max_depth=3, max_features=2,
                       max_leaf_nodes=15, min_samples_leaf=3,
                       min_samples_split=8, min_weight_fraction_leaf=0.01)
"""visualize using graphviz, need 1.pip install graphviz, 2.brew install graphviz"""
from graphviz import Source
from sklearn.tree import export_graphviz

export_graphviz(tree_clf,
                out_file="iris_tree.dot",
                feature_names=iris.feature_names[2:],
                class_names=iris.target_names,
                rounded=True,
                filled=True
               )

Source.from_file("iris_tree.dot")
../_images/8f688a50d6dfd9e3f189a3eeccf28db1dc139985bfd4d75e0dabc5eaa9a58847.svg