随机森林是一种灵活的、便于使用的机器学习算法,可以用来进行分类和回归任务。之前一篇文章通过R语言详细演示了如何用R做随机森林预测,本文主要通过python向大家介绍如何完成一个基本得随机森林预测。

随机森林的基本原理如下图:

我们主要用到Scikit-learn(sklearn)包,其对常用的机器学习方法进行了封装,包括回归(Regression)、降维(Dimensionality Reduction)、分类(Classfication)、聚类(Clustering)等方法。我们主要用 Scikit-learn 进行随机森林演示:


# 需要用到的模块
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.datasets import load_iris
from sklearn import metrics

import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

加载鸢尾花数据:


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)    

分训练集和测试集数据


df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

train, test = df[df['is_train']==True], \
                  df[df['is_train']==False]

定义特征列


features = df.columns[0:4]

RF建模(RandomForestClassifier函数)


forest = RFC(n_jobs=2,
                 n_estimators=500, # 在利用最大投票数或平均值来预测之前,建立子树的数量
                 criterion='gini')

y, _ = pd.factorize(train['species'])
forest.fit(train[features], y)

测试集预测


preds = iris.target_names[forest.predict(test[features])]

# 打印混淆矩阵
confusion_matrix = pd.crosstab(index=test['species'],
                                   columns=preds,
                                   rownames=['actual'],
                                   colnames=['preds'])
print(confusion_matrix)

preds setosa versicolor virginica
actual
setosa 13 0 0
versicolor 0 13 0
virginica 0 0 12

绘制混淆矩阵热图


sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ', metrics.accuracy_score(test['species'], preds))
plt.show()

计算变量重要程度


# 计算变量重要程度
importances = forest.feature_importances_
indices = np.argsort(importances)

# 绘制变量重要程度图
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
plt.show()

一个简单的随机森林预测模型就完成了。当然,利用随机森林预测也要注意避免数据过拟合,如:加大数据量、交叉验证、不同的测试集验证等。

下面给出本次的完整代码:


from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.datasets import load_iris
from sklearn import metrics

import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt


if __name__ == "__main__":

    # 加载鸢尾花数据
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)

    # 分训练集和测试集数据
    df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
    df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

    train, test = df[df['is_train']==True], \
                  df[df['is_train']==False]

    # 定义特征列
    features = df.columns[0:4]

    # RF建模
    forest = RFC(n_jobs=2,
                 n_estimators=500,
                 criterion='gini')
    y, _ = pd.factorize(train['species'])
    forest.fit(train[features], y)

    # 测试集预测
    preds = iris.target_names[forest.predict(test[features])]

    # 打印混淆矩阵
    confusion_matrix = pd.crosstab(index=test['species'],
                                   columns=preds,
                                   rownames=['actual'],
                                   colnames=['preds'])
    print(confusion_matrix)

    # 绘制混淆矩阵热图
    sn.heatmap(confusion_matrix, annot=True)
    print('Accuracy: ', metrics.accuracy_score(test['species'], preds))
    plt.show()

    # 计算变量重要程度
    importances = forest.feature_importances_
    indices = np.argsort(importances)

    # 绘制变量重要程度图
    plt.figure(1)
    plt.title('Feature Importances')
    plt.barh(range(len(indices)), importances[indices], color='b', align='center')
    plt.yticks(range(len(indices)), features[indices])
    plt.xlabel('Relative Importance')
    plt.show()

参考资料:

1.https://www.datacamp.com/community/tutorials/random-forests-classifier-python