title: Random Forest
A Random Forest is a group of decision trees that make better decisions as a whole than individually.
Decision trees by themselves are prone to overfitting. This means that the tree becomes so used to the training data that it has difficulty making decisions for data it has never seen before.
Solution with Random Forests
Random Forests belong in the category of ensemble learning algorithms. This class of algorithms use many estimators to yield better results. This makes Random Forests usually more accurate than plain decision trees. In Random Forests, a bunch of decision trees are created. Each tree is trained on a random subset of the data and a random subset of the features of that data. This way the possibility of the estimators getting used to the data (overfitting) is greatly reduced, because each of them work on the different data and features than the others. This method of creating a bunch of estimators and training them on random subsets of data is a technique in ensemble learning called bagging or Bootstrap AGGregatING. To get the prediction, the each of the decision trees vote on the correct prediction (classification) or they get the mean of their results (regression).
Example of Boosting in Python
In this competition, we are given a list of collision events and their properties. We will then predict whether a τ → 3μ decay happened in this collision. This τ → 3μ is currently assumed by scientists not to happen, and the goal of this competition was to discover τ → 3μ happening more frequently than scientists currently can understand.
The challenge here was to design a machine learning problem for something no one has ever observed before. Scientists at CERN developed the following designs to achieve the goal.
#Data Cleaning import pandas as pd data_test = pd.read_csv("test.csv") data_train = pd.read_csv("training.csv") data_train = data_train.drop('min_ANNmuon',1) data_train = data_train.drop('production',1) data_train = data_train.drop('mass',1) #Cleaned data Y = data_train['signal'] X = data_train.drop('signal',1) #adaboost from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier seed = 9001 #this ones over 9000!!! boosted_tree = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=50, random_state = seed) model = boosted_tree.fit(X, Y) predictions = model.predict(data_test) print(predictions) #Note we can't really validate this data since we don't have an array of "right answers" #stochastic gradient boosting from sklearn.ensemble import GradientBoostingClassifier gradient_boosted_tree = GradientBoostingClassifier(n_estimators=50, random_state=seed) model2 = gradient_boosted_tree.fit(X,Y) predictions2 = model2.predict(data_test) print(predictions2)