Decision tree
Random Forest is a method for classification, regression, and some kinds of prediction. The method is based on the
decision tree definition as a binary tree-like graph of decisions and possible consequences.
The image above shows a possible decision tree for a training set of items with three features X, Y, and Z. Z is the resolution feature to classify items by. Usually, a resolution feature contains categorical values; for numerical versions of algorithms, categorical values should be replaced by numeric codes. A training set is a number of items with known values of Z. Each terminal node contains a number of items from the training set after the decision tree has been generated, and an average value of Z feature can be calculated as an arithmetic mean and may be considered as a predicted value.
For example, there is an item {X=9, Y=3, Z=?} and prediction of a value of Z feature which is required. After the descent, according to the tree graph and conditions, we could say Z is equal to an arithmetic mean of Z values for training items from terminal node #4.
Generation of decision tree
The algorithm to generate a decision tree is recursive. Every iteration finds the best way to split a current training subset into two parts and get a dividing condition. Which division is the best? There are some criteria.
Gini coefficient is more suitable for the case of categorical values (classification). Since example values are numerical (regression), the
RSS (residual sum of squares) criteria will be applied, but the Gini coefficient is an option as well.
The following steps describe splitting algorithms in general. For each feature (except resolution feature, Z in the example above).
- Calculate RSS for current subset;
- Sort items by feature value;
- For each possible division into two subsets, calculate RSS for left (L) and right (R) subset;
- Find the division with min(RSS(L)+RSS(R)) and return feature name and value.
Criteria to stop recursion
- RSS for current subset is equal to 0;
- Items count in current subset is less than or equal to the acceptable maximum number of items in a terminal node.
Decision tree will be generated and training set will be divided into a number of categories (Terminal nodes) as a result.
Random forest
A forest is a number of trees. And what is a "random" forest? It is a number of decision trees generated based on a random subset of the initial training set. The important thing is that any random subset to generate a decision tree can contain duplicate training items. Each tree of a forest makes its prediction; a prediction of a forest is an arithmetic mean of all trees' predictions.
Library
First, prepare Microsoft Excel file with training data in a format like below.
X | Y | Z |
---|
4.49519 | 4.972543 | 1 |
3.050089 | 2.18187 | 2 |
0.943726 | 1.617163 | 1 |
3.886888 | 4.365553 | 1 |
.. | ... | ... |
Where X, Y, Z – feature the names of training item (values for each feature are known, Z is a resolution feature and contains numeric codes for a categorical description of classes). Please keep in mind mind that the current version of the library does not validate the data file.
- using RandomForest.Lib.Numerical.Interfaces;
-
-
- ForestGrowParameters p = new ForestGrowParameters
- {
- ExportDirectoryPath = [folder to save json tree files],
- ExportToJson = true,
- ResolutionFeatureName = [name of the feature to classify training items by],
- ItemSubsetCountRatio = [ratio to get a number of training items to form subset and generate tree],
- TrainigDataPath = [path to training data MS Excel file],
- MaxItemCountInCategory = [max number of items in a terminal node],
- TreeCount = [number of trees to generate],
- SplitMode = [RSS or Gini]
- };
-
-
- IForest forest = ForestFactory.Create();
- forest.GrowAsync(p);
-
-
- IItemNumerical item = forest.CreateItem();
- item.SetValue("X", 1.23);
- item.SetValue("Y", 2.05);
- double z = forest.Resolve(item);
Application
- Select Microsoft Excel file with training set data
- Select folder to save JSON files
- Input name of resolution feature (Z in training-data-sample.xlsx)
- Set number of trees
- Set max number of items in category/leaf (recommended 5-10)
- Set training subset count ratio (affects on algorithm duration, depends on training set count)
- Click "Generate" button and wait until progress reaches 100 percent
- After generation completes, a list of item feature names appears (X, Y in training-data-sample.xlsx). Input corresponding values and click "Resolve" button to get a prediction for resolution feature (Z).
Example trainig set
Here are some resolution hits. Note that any resolution has a random component and any fractional value should be rounded to the nearest integer (for the given example training set).
- X = 2.5, Y = 1.75, Z => 1
- X = 3, Y = 3, Z => 0
- X = 4, Y = 5, Z => 1.992 (2)
- X = 5, Y = 5, Z => 2
- X = 3.8, Y = 4, Z => 1.54804761904762 (2)
Find sample source code at
GitHub.