使用特征递归消除筛选特征

2015-04-01

特征递归消除(recursive feature elimination,RFE)是特征选择的一种方法。

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html给出了RFE的基本思想:

the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

而在《机器学习系统设计》的第11章也提到了RFE,并给了一个流程图:

可以看出RFE是一个迭代的过程(不是递归)。

下面讲一下如何使用scikit-learn中的RFE工具,本文最后部分介绍RFE的实现。

导入数据集


数据集是有8个分类的文本数据集,使用了结巴分词对每个文本分词,每个单词当作特征,再利用二元词串构造更多特征,然后去掉停用词,去掉出现次数太多和太少的特征,得到了19630个特征。取1998个样本用于训练,509个用于测试。基于词袋模型的思路将每个文本转换为向量,训练集和测试集分别转换为矩阵,并用python numpy模块将其保存为npy格式。

https://github.com/letiantian/dataset-for-classifying下载documents.7z,解压后导入数据:

$ ls
test_data.npy  test_labels.npy  training_data.npy  training_labels.npy  
$ ipython
>>> import numpy as np
>>> training_data = np.load("training_data.npy")
>>> training_data.shape
(1998, 19630)
>>> training_labels = np.load("training_labels.npy")
>>> training_labels
array([6, 6, 6, ..., 2, 2, 2])  
>>> training_labels.shape
(1998,)
>>> test_data = np.load("test_data.npy")
>>> test_data.shape
(509, 19630)
>>> test_labels = np.load("test_labels.npy")
>>> test_labels.shape
(509,)

使用RFE


RFE是一个框架,我们需要使用能得到特征重要性的方法嵌入其中。本文使用逻辑斯谛回归(在李航的《统计学习方法》中有介绍)。

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.feature_selection import RFE
>>> estimator = LogisticRegression()  # 逻辑斯谛回归
>>> selector = RFE(estimator, 3000, step=200)
>>> selector = selector.fit(training_data, training_labels)
>>> selector.support_   # True的特征就是最终得到的特征
array([False, False,  True, ..., False, False, False], dtype=bool)
>>> selector.ranking_  # 值越小,越重要
array([26, 64,  1, ..., 26, 56, 26])
>>> new_train_data = training_data[:, selector.support_]
>>> new_test_data = test_data[:, selector.support_]
>>> new_train_data.shape
(1998, 3000)
>>> new_test_data.shape
(509, 3000)

其中,selector = RFE(estimator, 3000, step=200)意思是每次迭代删除200个特征,最终保留3000个特征。

selector也可以使用内部训练好的逻辑斯谛回归模型预测样本类别:

>>> lr_predict_labels = selector.predict(test_data)
>>> sum(lr_predict_labels == test_labels)   # 预测对了448个样本
448

而对原始特征集合使用逻辑斯谛回归:

>>> lr = LogisticRegression()
>>> lr.fit(training_data, training_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,  
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)
>>> lr_predict_labels = lr.predict(test_data)
>>> sum(lr_predict_labels == test_labels)   # 预测对了446个样本
446  

使用多项式贝叶斯处理新数据


对原始特征集合:

>>> from sklearn.naive_bayes import MultinomialNB
>>> bayes = MultinomialNB() 
>>> bayes.fit(training_data, training_labels)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
>>> bayes_predict_labels = bayes.predict(test_data)
>>> sum(bayes_predict_labels == test_labels)  # 预测对了454个样本
454

对新的特征集合:

>>> bayes = MultinomialNB() 
>>> bayes.fit(new_train_data, training_labels)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
>>> bayes_predict_labels = bayes.predict(new_test_data)
>>> sum(bayes_predict_labels == test_labels)   # 预测对了456个样本
456

使用伯努利贝叶斯处理新数据


对原始特征集合:

>>> from sklearn.naive_bayes import BernoulliNB
>>> bayes = BernoulliNB() 
>>> bayes.fit(training_data, training_labels)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
>>> bayes_predict_labels = bayes.predict(test_data)
>>> sum(bayes_predict_labels == test_labels) 
387

对新的特征集合:

>>> bayes = BernoulliNB()
>>> bayes.fit(new_train_data, training_labels)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
>>> bayes_predict_labels = bayes.predict(new_test_data)
>>> sum(bayes_predict_labels == test_labels)  # 效果提升了很多
433

使用随机森林处理新数据


对原始特征集合:

>>> from sklearn.ensemble import RandomForestClassifier
>>> rf = RandomForestClassifier(n_estimators=100)
>>> rf.fit(training_data, training_labels)
RandomForestClassifier(bootstrap=True, criterion='gini', max_depth=None,
            max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
>>> rf_predict_labels = rf.predict(test_data)
>>> sum(rf_predict_labels == test_labels)
422

对新的特征集合:

>>> from sklearn.ensemble import RandomForestClassifier
>>> rf = RandomForestClassifier(n_estimators=100)
>>> rf.fit(new_train_data, training_labels)
RandomForestClassifier(bootstrap=True, criterion='gini', max_depth=None,
            max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
>>> rf_predict_labels = rf.predict(new_test_data)
>>> sum(rf_predict_labels == test_labels)   # 变化不大
421

如何实现RFE


从上面可以看到,RFE消除特征后,很多分类器的效果并没有下降,甚至有所提升。那么RFE是如何实现的?

在scikit-learn源码中,打开文件sklearn/feature_selection/rfe.py,可以看到RFE的实现,下面摘取部分并给出注释:

class RFE(BaseEstimator, MetaEstimatorMixin, SelectorMixin):

    def __init__(self, estimator, n_features_to_select=None, step=1,
                 estimator_params={}, verbose=0):
        self.estimator = estimator
        self.n_features_to_select = n_features_to_select
        self.step = step
        self.estimator_params = estimator_params
        self.verbose = verbose

    def fit(self, X, y):
        """Fit the RFE model and then the underlying estimator on the selected
           features.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            The training input samples.

        y : array-like, shape = [n_samples]
            The target values.
        """
        X, y = check_X_y(X, y, "csc")
        # Initialization
        n_features = X.shape[1]
        if self.n_features_to_select is None:
            n_features_to_select = n_features / 2
        else:
            n_features_to_select = self.n_features_to_select

        if 0.0 < self.step < 1.0:
            step = int(self.step * n_features)
        else:
            step = int(self.step)
        if step <= 0:
            raise ValueError("Step must be >0")

        support_ = np.ones(n_features, dtype=np.bool)   # true代表就用这个特征吧
        ranking_ = np.ones(n_features, dtype=np.int)    # 值越小越好
        # Elimination
        while np.sum(support_) > n_features_to_select:
            # Remaining features
            ''' the function of np.arange
            np.arange(3) -> array([0, 1, 2]), 
            np.arange(3)[np.array([True,False,True])]-> array([0, 2])
            '''
            features = np.arange(n_features)[support_]  # 每个值代表是第几个特征;若features中第3个值为9,代表新的第3个特征是原始的第9个特征

            # Rank the remaining features
            estimator = clone(self.estimator)
            estimator.set_params(**self.estimator_params)
            if self.verbose > 0:
                print("Fitting estimator with %d features." % np.sum(support_))

            estimator.fit(X[:, features], y)  # 使用现在的特征集合来训练

            ''' the function of np.argsort
            >>> x = np.array([3, 1, 2])
            >>> np.argsort(x)
            array([1, 2, 0])
            '''
            if estimator.coef_.ndim > 1:
                ranks = np.argsort(safe_sqr(estimator.coef_).sum(axis=0))  # 将特征按找重要性从小到大来排序
            else:
                ranks = np.argsort(safe_sqr(estimator.coef_))

            # for sparse case ranks is matrix
            '''the function of np.ravel
            >>> x = np.array([[1, 2, 3], [4, 5, 6]])
            >>> print np.ravel(x)
            [1 2 3 4 5 6]

            '''
            ranks = np.ravel(ranks)

            # Eliminate the worse features
            '''
            >>> np.array([89,20,100])[np.array([2,0,1])]
            >>> array([100,  89,  20])
            >>> np.logical_not(3)
            False
            >>> np.logical_not([True, False, 0, 1])
            array([False,  True,  True, False], dtype=bool)
            '''
            threshold = min(step, np.sum(support_) - n_features_to_select) # 该删除多少特征
            support_[features[ranks][:threshold]] = False 
            ranking_[np.logical_not(support_)] += 1

        # Set final attributes
        self.estimator_ = clone(self.estimator)
        self.estimator_.set_params(**self.estimator_params)
        self.estimator_.fit(X[:, support_], y)
        self.n_features_ = support_.sum()
        self.support_ = support_
        self.ranking_ = ranking_

        return self

实例变量estimator是一个评估器,比如逻辑斯谛回归分类器,评估器需要具有coef_属性。在训练一个评估器后,coef_相当于每个特征的权重。RFE在每一次迭代中得到新的coef_,并删掉权重低的特征,直到剩下的特征达到指定的数量。

这里有一个小问题:根据上面的实现的代码看,评估器需要具有coef_属性来指明每个特征的重要性。决策树也应该可以拿来用,但是决策树只有feature_importances_属性(参考sklearn.tree.DecisionTreeClassifier),而没有coef_属性。也就是,实际上决策树不能放在上面实现的RFE中。笔者认为原因是:决策树中越重要的特征离树的根越近,越不重要的特征离根越远。在删除决策树认为的不重要的部分特征后,新建的决策树中的特征按照重要性排序,依然和之前的决策树相同。也就是说,使用决策树,直接通过feature_importances_一次性地选取重要特征就行了,多次选取和一次选取的结果是一样的。

这里有一个小问题:根据上面的实现的代码看,评估器需要具有coef_属性来指明每个特征的重要性。决策树也应该可以拿来用,但是决策树只有feature_importances_属性(参考sklearn.tree.DecisionTreeClassifier),而没有coef_属性。也就是,实际上决策树不能放在上面实现的RFE中。既然这样,就自己实现个吧(mytools.py):

# !/usr/bin/env python
# -*- encoding:utf-8 -*-

import numpy as np
from sklearn.feature_selection import RFE
from sklearn.utils import check_X_y, safe_sqr
from sklearn.base import clone

class RFE4Tree(RFE):

    def __init__(self, estimator, n_features_to_select=None, step=1,
                 estimator_params={}, verbose=0):
        self.estimator = estimator
        self.n_features_to_select = n_features_to_select
        self.step = step
        self.estimator_params = estimator_params
        self.verbose = verbose

    def fit(self, X, y):

        X, y = check_X_y(X, y, "csc")
        # Initialization
        n_features = X.shape[1]
        if self.n_features_to_select is None:
            n_features_to_select = n_features / 2
        else:
            n_features_to_select = self.n_features_to_select

        if 0.0 < self.step < 1.0:
            step = int(self.step * n_features)
        else:
            step = int(self.step)
        if step <= 0:
            raise ValueError("Step must be >0")

        support_ = np.ones(n_features, dtype=np.bool)   # true代表就用这个特征吧
        ranking_ = np.ones(n_features, dtype=np.int)    # 值越小越好
        # Elimination
        while np.sum(support_) > n_features_to_select:
            print '>once'
            # Remaining features
            features = np.arange(n_features)[support_]  # 每个值代表是第几个特征;若features中第3个值为9,代表新的第3个特征是原始的第9个特征

            # Rank the remaining features
            estimator = clone(self.estimator)
            estimator.set_params(**self.estimator_params)
            if self.verbose > 0:
                print("Fitting estimator with %d features." % np.sum(support_))

            estimator.fit(X[:, features], y)  # 使用现在的特征集合来训练

            ranks = np.argsort(estimator.feature_importances_)  # 利用决策树的feature_importances_属性

            # for sparse case ranks is matrix
            ranks = np.ravel(ranks)

            # Eliminate the worse features
            threshold = min(step, np.sum(support_) - n_features_to_select) # 该删除多少特征
            support_[features[ranks][:threshold]] = False 
            ranking_[np.logical_not(support_)] += 1

        # Set final attributes
        self.estimator_ = clone(self.estimator)
        self.estimator_.set_params(**self.estimator_params)
        self.estimator_.fit(X[:, support_], y)
        self.n_features_ = support_.sum()
        self.support_ = support_
        self.ranking_ = ranking_

        return self

迭代1次与迭代多次选取同样数量特征的效果对比


使用scikit-learn的RFE:

>>> from sklearn.feature_selection import RFE
>>> from sklearn.linear_model import LogisticRegression
>>> estimator = LogisticRegression()  # 逻辑斯谛回归
>>> selector1 = RFE(estimator, 3000, step=200)  # 需要迭代多次
>>> selector1 = selector1.fit(training_data, training_labels)
>>> training_data.shape
(1998, 19630)
>>> selector2 = RFE(estimator, 3000, step=19630-3000)  # 迭代一次就行了
>>> selector2 = selector2.fit(training_data, training_labels)
>>> sum(selector1.support_)
3000
>>> sum(selector2.support_)
3000
>>> sum(selector2.support_ == selector2.support_)
19630
>>> sum(selector1.support_ == selector2.support_)  # 两种方法选取的特征是不一样的
19208

# 构建新数据
>>> new_train_data2 = training_data[:, selector2.support_]
>>> new_test_data2 = test_data[:, selector2.support_]

# 使用BernoulliNB
>>> from sklearn.naive_bayes import BernoulliNB
>>> bayes = BernoulliNB() 
>>> bayes.fit(new_train_data2, training_labels)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
>>> bayes_predict_labels = bayes.predict(new_test_data2)
>>> sum(bayes_predict_labels == test_labels)  # 效果差不多
431

# 使用selector2内置的LogisticRegression
>>> lr_predict_labels = selector2.predict(test_data)
>>> sum(lr_predict_labels == test_labels)     # 效果差不多
449

那决策树呢?

>>> from sklearn.tree import DecisionTreeClassifier
>>> import mytools
>>> estimator = DecisionTreeClassifier(criterion='entropy')
>>> selector1 = mytools.RFE4Tree(estimator, 3000, step=200)  # 迭代多次
>>> selector2 = mytools.RFE4Tree(estimator, 3000, step=19630-3000)  # 迭代1次
>>> selector1.fit(training_data, training_labels)
>>> selector2.fit(training_data, training_labels)
>>> sum(selector1.support_)
3000
>>> sum(selector2.support_)
3000
>>> sum(selector1.support_ == selector2.support_)  # 两种方法最终选择的特征并不相同
13954
>>> tree_predict_labels = selector1.predict(test_data)  # 利用selector1选择的特征来分类
>>> sum(tree_predict_labels == test_labels)  # selector1分类正确的数量
347
>>> tree_predict_labels = selector2.predict(test_data) # 利用selector2选择的特征来分类
>>> sum(tree_predict_labels == test_labels)  # selector2分类正确的数量
340
>>> estimator.fit(training_data, training_labels)  # 用所有的特征来训练
>>> tree_predict_labels = estimator.predict(test_data) # 预测
>>> sum(tree_predict_labels == test_labels)        
341

(完)

( 完 )