API
The main class of the package is the random_forestry.RandomForest class
which holds a model and has various member functions for fitting the model,
getting predictions, saving and loading models, and getting and setting parameters.
- random_forestry.RandomForest(ntree=500, replace=True, sampsize=None, sample_fraction=None, mtry=None, nodesize_spl=5, nodesize_avg=5, nodesize_strict_spl=1, nodesize_strict_avg=1, min_split_gain=0, max_depth=None, interaction_depth=None, splitratio=1.0, oob_honest=False, double_bootstrap=None, seed=140, verbose=False, nthread=0, splitrule='variance', middle_split=False, max_obs=None, linear=False, min_trees_per_fold=0, fold_size=1, monotone_avg=False, overfit_penalty=1, scale=False, double_tree=False, na_direction=False)
The Random Forest Regressor class.
- Parameters:
ntree (int, optional, default=500) – The number of trees to grow in the forest.
replace (bool, optional, default=True) – An indicator of whether sampling of the training data is done with replacement.
sampsize (int, optional) – The size of total samples to draw for the training data. If sampling with replacement, the default value is the length of the training data. If sampling without replacement, the default value is two-thirds of the length of the training data.
sample_fraction (float, optional) – If this is given, then sampsize is ignored and set to be
round(len(y) * sample_fraction). It must be a real number between 0 and 1.mtry (int, optional) – The number of variables randomly selected at each split point. The default value is set to be one-third of the total number of features of the training data.
nodesize_spl (int, optional, default=5) – Minimum observations contained in terminal nodes.
nodesize_avg (int, optional, default=5) – Minimum size of terminal nodes for averaging dataset.
nodesize_strict_spl (int, optional, default=1) – Minimum observations to follow strictly in terminal nodes.
nodesize_strict_avg (int, optional, default=1) – The minimum size of terminal nodes for averaging data set to follow when predicting. No splits are allowed that result in nodes with observations less than this parameter. This parameter enforces overlap of the averaging data set with the splitting set when training. When using honesty, splits that leave less than nodesizeStrictAvg averaging observations in either child node will be rejected, ensuring every leaf node also has at least nodesizeStrictAvg averaging observations.
min_split_gain (float, optional, default=0) – Minimum loss reduction to split a node further in a tree.
max_depth (int, optional, default=99) – Maximum depth of a tree.
interaction_depth (int, optional, default=maxDepth) – All splits at or above interaction depth must be on variables that are not weighting variables (as provided by the interactionVariables argument in fit).
splitratio (double, optional, default=1) – Proportion of the training data used as the splitting dataset. It is a ratio between 0 and 1. If the ratio is 1 (the default), then the splitting set uses the entire data, as does the averaging set—i.e., the standard Breiman RF setup. If the ratio is 0, then the splitting data set is empty, and the entire dataset is used for the averaging set (This is not a good usage, however, since there will be no data available for splitting).
oob_honest (bool, optional, default=False) – In this version of honesty, the out-of-bag observations for each tree are used as the honest (averaging) set. This setting also changes how predictions are constructed. When predicting for observations that are out-of-sample
(predict(..., aggregation = "average")), all the trees in the forest are used to construct predictions. When predicting for an observation that was in-sample(predict(..., aggregation = "oob")), only the trees for which that observation was not in the averaging set are used to construct the prediction for that observation. aggregation=”oob” (out-of-bag) ensures that the outcome value for an observation is never used to construct predictions for a given observation even when it is in sample. This property does not hold in standard honesty, which relies on an asymptotic subsampling argument. By default, when oob_honest=True, the out-of-bag observations for each tree are resamples with replacement to be used for the honest (averaging) set. This results in a third set of observations that are left out of both the splitting and averaging set, we call these the double out-of-bag (doubleOOB) observations. In order to get the predictions of only the trees in which each observation fell into this doubleOOB set, one can runpredict(... , aggregation = "doubleOOB"). In order to not do this second bootstrap sample, the doubleBootstrap flag can be set to False.double_bootstrap (bool, optional, default=oob_honest) – The doubleBootstrap flag provides the option to resample with replacement from the out-of-bag observations set for each tree to construct the averaging set when using OOBhonest. If this is False, the out-of-bag observations are used as the averaging set. By default this option is True when running oob_honest=True. This option increases diversity across trees.
seed (int, optional) – Random number generator seed. The default value is a random integer.
verbose (bool, optional, default=False) – Indicator to train the forest in verbose mode.
nthread (int, optional, default=0) – Number of threads to train and predict the forest. The default number is 0 which represents using all cores.
splitrule (str, optional, default=’variance’) – Only variance is implemented at this point and, it specifies the loss function according to which the splits of random forest should be made.
middle_split (bool, optional, default=False) – Indicator of whether the split value is takes the average of two feature values. If False, it will take a point based on a uniform distribution between two feature values.
max_obs (int, optional) – The max number of observations to split on. The default is the number of observations.
linear (bool, optional, default=False) – Indicator that enables Ridge penalized splits and linear aggregation functions in the leaf nodes. This is recommended for data with linear outcomes. For implementation details, see: https://arxiv.org/abs/1906.06463.
min_trees_per_fold (int, optional, default=0) – The number of trees which we make sure have been created leaving out each fold (each fold is a set of randomly selected groups). This is 0 by default, so we will not give any special treatment to the groups when sampling observations, however if this is set to a positive integer, we modify the bootstrap sampling scheme to ensure that exactly that many trees have each group left out. We do this by, for each fold, creating min_trees_per_fold trees which are built on observations sampled from the set of training observations which are not in a group in the current fold. The folds form a random partition of all of the possible groups, each of size foldSize. This means we create at least # folds * min_trees_per_fold trees for the forest. If ntree > # folds * min_trees_per_fold, we create max(# folds * min_trees_per_fold, ntree) total trees, in which at least min_trees_per_fold are created leaving out each fold.
fold_size (int, optional, default=1) – The number of groups that are selected randomly for each fold to be left out when using minTreesPerFold. When minTreesPerFold is set and foldSize is set, all possible groups will be partitioned into folds, each containing foldSize unique groups (if foldSize doesn’t evenly divide the number of groups, a single fold will be smaller, as it will contain the remaining groups). Then minTreesPerFold are grown with each entire fold of groups left out.
monotone_avg (bool, optional, default=False) – This is a flag that indicates whether or not monotonic constraints should be enforced on the averaging set in addition to the splitting set. This flag is meaningless unless both honesty and monotonic constraints are in use.
overfit_penalty (float, optional, default=1) – Value to determine how much to penalize the magnitude of coefficients in ridge regression when using linear splits.
scale (bool, optional, default=True) – A parameter which indicates whether or not we want to scale and center the covariates and outcome before doing the regression. This can help with stability, so the default is True.
na_direction (bool, optional, default=False) – Sets a default direction for missing values in each split node during training. It test placing all missing values to the left and right, then selects the direction that minimizes loss. If no missing values exist, then a default direction is randomly selected in proportion to the distribution of observations on the left and right. (Default = FALSE)
double_tree (bool, optional, default=False) – Indicator of whether the number of trees is doubled as averaging and splitting data can be exchanged to create decorrelated trees.
- Variables:
processed_dta () –
A data structure containing information about the data after it has been preprocessed. processed_dta has the following entries:
processed_x (pandas.DataFrame) - The processed feature matrix.
y (numpy.array of shape[nrows,]) - The processed target values.
categorical_feature_cols (numpy.array) - An array of the indices of the categorical features in the feature matrix.
Note
In order for the program to recognize a feature as categorical, it must be converted into a Pandas categorical data type. The simplest way to do it is to use:
df['categorical'] = df['categorical'].astype('category')
Check out the Handling Categorical Data section for an example of how to use categorical features.
categorical_feature_mapping (list[dict]) - For each categorical feature, the data is encoded into numeric represetation. Those encodings are saved in categoricalFeatureMapping. categoricalFeatureMapping[i] and has the following entries:
categorical_feature_col (int) - The index of the current categorical feature column.
unique_feature_values (list) - The categories of the current categorical feature.
numeric_feature_values (numpy.array) - The categories of the current categorical feature encoded into numeric represetation.
feature_weights (numpy.array of shape[ncols]) - an array of sampling probabilities/weights for each feature used when subsampling mtry features at each node. Check out
fit()fot more details.feature_weights_variables (numpy.array) - Indices of the features which weight more than
max(feature_weights)*0.001.deep_feature_weights (numpy.array of shape[ncols]) - Used in place of feature_weights for splits below interaction_depth. Check out
fit()fot more details.deep_feature_weights_variables (numpy.array) - Indices of the features which weight more than
max(deep_feature_weights)*0.001.observation_weights (numpy.array of shape[nrows]) - Denotes the weights for each training observation that determine how likely the observation is to be selected in each bootstrap sample. Check out
fit()fot more details.monotonic_constraints (numpy.array of shape[ncols]) - An array of size ncol specifying monotonic relationships between the continuous features and the outcome. Its entries are in -1, 0, 1, in which 1 indicates an increasing monotonic relationship, -1 indicates a decreasing monotonic relationship, and 0 indicates no constraint. Check out
fit()fot more details.linear_feature_cols (numpy.array) - An array containing the indices of which features to split linearly on when using linear penalized splits. Check out
fit()fot more details.groups_mapping (dict) - Contains informtion about the groups of the training observations. Has the following entries:
group_value (pandas.Index) - The categories of the groups.
group_numeric_value (numpy.array) - The categories of the groups encoded into numeric represetation
groups (pandas.Series(…, dtype=’category’)) - Specifies the group membership of each training observation. Check out
fit()fot more details.col_means (numpy.array of shape[ncols]) - The mean value of each column.
col_sd (numpy.array of shape[ncols]) - The standard deviation of each column.
has_nas (bool) - Specifies whether the feature matrix contains missing observations or not.
na_direction (bool) - Sets a default direction for missing values in each split node during training
n_observations (int) - The number of observations in the training data.
num_columns (int) - The number of features in the training data.
feat_names (numpy.array of shape[ncols]) - The names of the features used for training.
Note that all of the entries in processed_dta are set to
Noneor empty containers during initialization. They are only assigned a value afterfit()is called.saved_forest (list[dict]) –
For any tree i in the forest, saved_forest[i] is a dictionary which gives access to the underlying structrure of that tree. saved_forest[i] has the following entries:
children_right (numpy.array of shape[number of nodes in the tree,]) - For a node with a given id, children_right[id] gives the id of the right child of that node. If leaf node, children_right[id] is -1.
children_left (numpy.array of shape[number of nodes in the tree,]) - For a node with a given id, children_left[id] gives the id of the left child of that node. If leaf node, children_left[id] is -1.
feature (numpy.array of shape[number of nodes in the tree,]) - For a node with a given id, feature[id] gives the index of the splitting feature in that node. If leaf node, feature[id] is the negative number of observations in the averaging set of that node.
n_node_samples (numpy.array of shape[number of nodes in the tree,]) - For a node with a given id, feature[id] gives the number of observations in the averaging set of that node.
threshold (numpy.array of shape[number of nodes in the tree,]) - For a node with a given id, threshold[id] gives the splitting point (threshold) of the split in that node. If leaf node, threshold[id] is 0.0.
values (numpy.array of shape[number of nodes in the tree,]) - For a node with a given id, if that node is a leaf node, values[id] gives the prediction made by that node. Otherwise, values[id] is 0.0.
Note
When a RandomForest is initialized, saved_forest is set to a list of ntree empty dictionaries. In order to populate those dictionaries, one must use the
translate_tree()method.forest (ctypes.c_void_p) – A ctypes pointer to the forestry object in C++. It is initially set to None and updated only after
fit()is called.dataframe (ctypes.c_void_p) – A ctypes pointer to the DataFrame object in C++. It is initially set to None and updated only after
fit()is called.
- Return type:
None
- random_forestry.RandomForest.fit(self, x, y, *, interaction_variables=None, feature_weights=None, deep_feature_weights=None, observation_weights=None, lin_feats=None, monotonic_constraints=None, groups=None, seed=None)
Trains all the trees in the forest.
- Parameters:
x (pandas.DataFrame, pandas.Series, numpy.ndarray, 2d list of shape [nrows, ncols]) – The feature matrix.
y (array_like of shape [nrows,]) – The target values.
interactionVariables (array_like, optional, default=[]) – Indices of weighting variables.
featureWeights (array_like of shape [ncols,], optional) – a list of sampling probabilities/weights for each feature used when subsampling mtry features at each node above or at interactionDepth. The default is to use uniform probabilities.
deepFeatureWeights (array_like of shape [ncols,], optional) – Used in place of featureWeights for splits below interactionDepth. The default is to use uniform probabilities.
observationWeights (array_like of shape [nrows,], optional) – Denotes the weights for each training observation that determine how likely the observation is to be selected in each bootstrap sample. The default is to use uniform probabilities. This option is not allowed when sampling is done without replacement.
linFeats (array_like, optional) – A list containing the indices of which features to split linearly on when using linear penalized splits (defaults to use all numerical features).
monotonicConstraints (array_like of shape [ncols,], optional) – Specifies monotonic relationships between the continuous features and the outcome. Supplied as a list of length ncol with entries in 1, 0, -1, with 1 indicating an increasing monotonic relationship, -1 indicating a decreasing monotonic relationship, and 0 indicating no constraint. Constraints supplied for categorical variable will be ignored. Defaults to all 0-s (no constraints).
groups (pandas.Categorical(…), pandas.Series(…, dtype=”category”), or other pandas categorical dtypes, optional, default=None) – A pandas categorical Seires specifying the group membership of each training observation. These groups are used in the aggregation when doing out of bag predictions in order to predict with only trees where the entire group was not used for aggregation. This allows the user to specify custom subgroups which will be used to create predictions which do not use any data from a common group to make predictions for any observation in the group. This can be used to create general custom resampling schemes, and provide predictions consistent with the Out-of-Group set.
seed (int, optional) – Random number generator seed. The default value is the RandomForest seed.
interaction_variables (List | None) –
feature_weights (ndarray | None) –
deep_feature_weights (ndarray | None) –
observation_weights (ndarray | None) –
lin_feats (ndarray | List | None) –
monotonic_constraints (ndarray | None) –
- Return type:
None
- random_forestry.RandomForest.predict(self, newdata=None, *, aggregation='average', seed=None, nthread=None, exact=None, trees=None, training_idx=None, return_weight_matrix=False)
Return the prediction from the forest.
- Parameters:
newdata (pandas.DataFrame, pandas.Series, numpy.ndarray, 2d list of shape [nsamples, ncols], deffault=None) – Testing predictors.
aggregation (str, optional, default=’average’) – How the individual tree predictions are aggregated: ‘average’ returns the mean of all trees in the forest; ‘terminalNodes’ also returns the weightMatrix, as well as “terminalNodes” - a matrix where the i-th entry of the j-th column is the index of the leaf node to which the i-th observation is assigned in the j-th tree; and “sparse” - a matrix where the ioth entry in the j-th column is 1 if the ith observation in newdata is assigned to the j-th leaf and 0 otherwise. In each tree the leaves are indexed using a depth first ordering, and, in the “sparse” representation, the first leaf in the second tree has column index one more than the number of leaves in the first tree and so on. So, for example, if the first tree has 5 leaves, the sixth column of the “sparse” matrix corresponds to the first leaf in the second tree. ‘oob’ returns the out-of-bag predictions for the forest. We assume that the ordering of the observations in newdata have not changed from training. If the ordering has changed, we will get the wrong OOB indices. ‘doubleOOB’ is an experimental flag, which can only be used when OOBhonest=True and doubleBootstrap=True. When both of these settings are on, the splitting set is selected as a bootstrap sample of observations and the averaging set is selected as a bootstrap sample of the observations which were left out of bag during the splitting set selection. This leaves a third set which is the observations which were not selected in either bootstrap sample. For each observation, this predict flag gives the predictions using only the trees in which the observation fell into this third set (so was neither a splitting nor averaging example). ‘coefs’ is an aggregation option which works only when linear aggregation functions have been used. This returns the linear coefficients for each linear feature which were used in the leaf node regression of each predicted point.
seed (int, optional) – Random number generator seed. The default value is the RandomForest seed.
nthread (int, optional) – The number of threads with which to run the predictions with. This will default to the number of threads with which the forest was trained with.
exact (bool, optional) – This specifies whether the forest predictions should be aggregated in a reproducible ordering. Due to the non-associativity of floating point addition, when we predict in parallel, predictions will be aggregated in varied orders as different threads finish at different times. By default, exact is True unless
N>100,000or a custom aggregation function is used.trees (array_like, optional) –
A list of indices in the range [0, ntree), which tells predict which trees in the forest to use for the prediction. Predict will by default take the average of all trees in the forest, although this flag can be used to get single tree predictions, or averages of different trees with different weightings.
Note
Duplicate entries are allowed, so if
trees = [0,1,1]this will predict the weighted average prediction of only trees 0 and 1 weighted by:predict(..., trees = [0,1,1]) = (predict(..., trees = [0]) + 2*predict(..., trees = [1])) / 3we must haveexact = True, andaggregation = "average"to use tree indices.Defaults to using all trees equally weighted.
training_idx (array_like, optional) – When doing OOB predictions with a data set that is of a different size than the training data, training_idx holds the indices of the training observations that should be used for determining the out-of-bag set for each observation in newdata. Entries must be between 1 and the number of training observations, and the length must be equal to the number of observations in newdata.
weightMatrix (bool, optional, default=False) – An indicator of whether or not we should also return a matrix of the weights given to each training observation when making each prediction. When getting the weight matrix, aggregation must be one of ‘average’, ‘oob’, and ‘doubleOOB’. his is a normal text paragraph.
return_weight_matrix (bool) –
- Returns:
An array of predicted responses.
- Return type:
numpy.array
- random_forestry.RandomForest.get_oob(self, no_warning=False)
Calculate the out-of-bag error of a given forest. This is done by using the out-of-bag predictions for each observation, and calculating the MSE over the entire forest.
- Parameters:
noWarning (bool, optional, default=False) – A flag to not display warnings.
no_warning (bool) –
- Returns:
The OOB error of the forest.
- Return type:
float
- random_forestry.RandomForest.get_vi(self, no_warning=False)
Calculate the percentage increase in OOB error of the forest when each feature is shuffled.
- Parameters:
noWarning – A flag to not display warnings.
no_warning (bool) –
- Returns:
The variable importance of the forest.
- Return type:
ndarray | None
- random_forestry.RandomForest.score(self, X, y, sample_weight=None)
Gets the coefficient of determination (R 2).
- Parameters:
X (pandas.DataFrame, pandas.Series, numpy.ndarray, 2d list of shape [nsamples, ncols]) – Testing samples.
y (array_like of shape [nsamples,]) – True outcome values of X.
sample_weight (array_like of shape [nsamples,], optional, default=None) – Sample weights. Uses equal weights by default.
- Returns:
The value of R 2.
- Return type:
float
- random_forestry.RandomForest.translate_tree(self, tree_ids=None)
Given a trained forest, translates the selected trees by allowing access to its underlying structure. After translating tree i, its structure will be stored as a dictionary in saved_forest and can be accessed by
[RandomForest object].saved_forest[i]. Check out the saved_forest attribute for more details about its structure.- Parameters:
tree_ids (int/array_like, optional) – The indices of the trees to be translated. By default, all the trees in the forest are translated.
- Return type:
None
- random_forestry.RandomForest.get_parameters(self)
Get the parameters of RandomForest.
- Returns:
A dictionary mapping parameter names of the RandomForest to their values.
- Return type:
dict
- random_forestry.RandomForest.set_parameters(self, **new_parameters)
Set the parameters of the RandomForest.
- Parameters:
**params –
Forestry parameters.
new_parameters (dict) –
- Returns:
A new RandomForest object with the given parameters. Note: this reinitializes the RandomForest object, so fit must be called on the new estimator.
- Return type:
RandomForest
- random_forestry.RandomForest.save_forestry(self, filename)
Given a trained forest, saves the forest using pickle in the file given by filename. This can be used to save a model for future analysis or share a model after training.
- Parameters:
filename (Path) – The name of the file to save the forest model to
- Return type:
None
- random_forestry.RandomForest.load_forestry(filename)
Loads a forest that has been saved using save_forestry. Since the forest contains a pointer to the C++ object, it is necessary to rebuild this object and relink the pointer before the forest can be used to make predictions etc.
- Parameters:
filename (Path) – The name of the file to save the
- Return type:
None