Entropy, Information gain, and Gini Index: Decision Tree

| 4 min read

decision tree entropy information gain gini index classification machine learning feature selection impurity splitting

Introduction to Decision Trees

The decision tree algorithm is one of the most widely used methods for inductive inference. It approximates discrete-valued target functions while handling noisy data robustly. It also learns complex patterns in the data efficiently.

Decision Tree Algorithms Overview

The family of decision tree learning algorithms includes ID3, CART, ASSISTANT, and others. These supervised learning algorithms perform classification and regression tasks. They classify instances by sorting them down the tree, starting at the root node and ending at a leaf node that provides the classification result.

Each node tests an attribute of the instance. Branches descending from that node represent possible values of that attribute. Classification starts at the root, tests an attribute, and moves down the corresponding branch. This process repeats at each subsequent node until reaching a leaf.

Main Idea of Decision Trees

The main idea of a decision tree is to identify features that contain the most information about the target feature. Then, the algorithm splits the dataset along those features’ values so the target feature values at resulting nodes are as pure as possible. 

A feature that best separates uncertainty from information about the target feature is called the most informative feature. The search for the most informative feature continues until pure leaf nodes are reached.

Building a Decision Tree

Building a decision tree involves asking a question at every node and splitting the data accordingly. When multiple features influence the target value, deciding which feature to use first is crucial. Also, deciding the order of features at further splits matters.

This is where measuring informativeness of features comes in. The feature with the most information is chosen to split the data. This measure is called ‘information gain’. To understand it, we first need to understand entropy.

Entropy: Measuring Impurity

Entropy measures impurity or randomness in a dataset. Imagine choosing a yellow ball from a box filled only with 100 yellow balls. The entropy here is zero, meaning the dataset is pure.

Now, replace 30 yellow balls with red and 20 with blue. The chance of drawing a yellow ball drops from 1.0 to 0.5. Since impurity increased, entropy rises while purity decreases. Shannon’s entropy formula uses the logarithm base 2 (log2(P(x))) to measure entropy. As the probability P(x) of drawing a yellow ball increases, the entropy approaches zero.

When the target feature has multiple possible values, entropy sums the entropies of each value weighted by their probabilities. This leads to Shannon’s entropy formula, which forms the baseline for information gain calculation:

Shannon entropy formula in decision tree feature selection

Here, P(x=k) is the probability that the target feature takes value k. Because logarithms of fractions are negative, the formula uses a negative sign to keep entropy positive. Maximum entropy depends on the number of classes:

  • 2 classes: Max entropy = 1
  • 4 classes: Max entropy = 2
  • 8 classes: Max entropy = 3
  • 16 classes: Max entropy = 4

Information Gain: Choosing the Best Feature

Information gain helps find the best feature for splitting the data at the root. We split the dataset using each descriptive feature and calculate the resulting entropy. Then, subtract this remaining entropy from the original entropy. The difference shows how much the feature reduces uncertainty. This value is the information gain:

Information gain calculation for decision tree splitting
  • The feature with the highest information gain becomes the root node to start building the tree.

The ID3 algorithm uses information gain to construct decision trees.

Gini Index: An Alternative Criterion

Gini Index calculates impurity by subtracting the sum of squared class probabilities from one. It favors larger partitions and is easier to implement compared to information gain, which favors smaller, distinct-value partitions.

Gini index formula used in decision tree splitting

A feature with a lower Gini index is chosen for splitting. The classic CART algorithm uses the Gini index.

Conclusion

Information measures uncertainty reduction. It estimates how much data is needed to classify a new instance. These measures form the foundation of decision tree algorithms. Information gain (based on entropy) offers a wider result range, while the Gini index caps at one.

The post Entropy, Information gain, and Gini Index: Decision Tree appeared first on Alpesh Kumar.

Subscribe to Our Newsletter

We don’t spam! Read our privacy policy for more info.