56 lines
3.4 KiB
Markdown
56 lines
3.4 KiB
Markdown
---
|
|
title: Random Forest
|
|
---
|
|
## Random Forest
|
|
A Random Forest is a group of decision trees that make better decisions as a whole than individually.
|
|
|
|
### Problem
|
|
Decision trees by themselves are prone to **overfitting**. This means that the tree becomes so used to the training data that it has difficulty making decisions for data it has never seen before.
|
|
|
|
### Solution with Random Forests
|
|
Random Forests belong in the category of **ensemble learning** algorithms. This class of algorithms use many estimators to yield better results. This makes Random Forests usually **more accurate** than plain decision trees. In Random Forests, a bunch of decision trees are created. Each tree is **trained on a random subset of the data and a random subset of the features of that data**. This way the possibility of the estimators getting used to the data (overfitting) is greatly reduced, because **each of them work on the different data and features** than the others. This method of creating a bunch of estimators and training them on random subsets of data is a technique in *ensemble learning* called **bagging** or *Bootstrap AGGregatING*. To get the prediction, the each of the decision trees vote on the correct prediction (classification) or they get the mean of their results (regression).
|
|
|
|
### Example of Boosting in Python
|
|
In this competition, we are given a list of collision events and their properties. We will then predict whether a τ → 3μ decay happened in this collision. This τ → 3μ is currently assumed by scientists not to happen, and the goal of this competition was to discover τ → 3μ happening more frequently than scientists currently can understand.
|
|
The challenge here was to design a machine learning problem for something no one has ever observed before. Scientists at CERN developed the following designs to achieve the goal.
|
|
https://www.kaggle.com/c/flavours-of-physics/data
|
|
|
|
```python
|
|
#Data Cleaning
|
|
import pandas as pd
|
|
data_test = pd.read_csv("test.csv")
|
|
data_train = pd.read_csv("training.csv")
|
|
data_train = data_train.drop('min_ANNmuon',1)
|
|
data_train = data_train.drop('production',1)
|
|
data_train = data_train.drop('mass',1)
|
|
|
|
#Cleaned data
|
|
Y = data_train['signal']
|
|
X = data_train.drop('signal',1)
|
|
|
|
#adaboost
|
|
from sklearn.ensemble import AdaBoostClassifier
|
|
from sklearn.tree import DecisionTreeClassifier
|
|
seed = 9001 #this ones over 9000!!!
|
|
boosted_tree = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME",
|
|
n_estimators=50, random_state = seed)
|
|
model = boosted_tree.fit(X, Y)
|
|
|
|
predictions = model.predict(data_test)
|
|
print(predictions)
|
|
#Note we can't really validate this data since we don't have an array of "right answers"
|
|
|
|
#stochastic gradient boosting
|
|
from sklearn.ensemble import GradientBoostingClassifier
|
|
gradient_boosted_tree = GradientBoostingClassifier(n_estimators=50, random_state=seed)
|
|
model2 = gradient_boosted_tree.fit(X,Y)
|
|
|
|
predictions2 = model2.predict(data_test)
|
|
print(predictions2)
|
|
```
|
|
|
|
#### More Information:
|
|
- <a href='https://www.wikiwand.com/en/Random_forest' target='_blank' rel='nofollow'>Random Forests (Wikipedia)</a>
|
|
- <a href='https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/' target='_blank' rel='nofollow'>Introduction to Random Forests - Simplified</a>
|
|
- <a href='https://www.youtube.com/watch?v=loNcrMjYh64' target='_blank' rel='nofollow'>How Random Forest algorithm works (Video)</a>
|