华为云AI开发平台ModelArts卡方选择_云淘科技

概述

采用卡方检验来进行特征选择。

卡方检验(Chi-Squared Test或χ2 Test)的基本思想是通过特征变量与目标变量之间的偏差大小来选择相关性较大的特征变量。首先假设两个变量是独立的,然后观察实际值与理论值的偏差程度,该偏差程度代表两个变量之间的相关性。若某个特征变量与目标变量之间偏差程度越大,则它们的相关性越高,最后根据相关性对特征变量进行排序,并选择与目标变量相关性较大的特征变量。卡方检验中假设理论值为E,第i个样本的实际值为xi,则偏差程度的计算公式如下:

输入

参数

子参数

参数说明

inputs

dataframe

inputs为字典类型,dataframe为pyspark中的DataFrame类型对象

输出

数据集

参数说明

参数

子参数

参数说明

input_features_str

输入列名组成的格式化字符串,例如:

“column_a”

“column_a,column_b”

label_col

目标列,基于该列进行卡方检验

chi_label_index_col

目标列通过标签编码得到的新列名,默认为label_index

chi_features_col

调用spark卡方选择需要的输入特征向量列名,默认为input_features

chi_output_col

调用spark卡方选择需要的输入特征向量列名,默认为output_features

selector_type

卡方选择的选择方法,支持numTopFeatures,percentile,fpr,fdr,fwe

num_top_features

选择的特征个数,默认为50

percentile

选择的特征个数占原始特征数量的比例,默认为0.1

fpr

最高的p-value,默认为0.05

fdr

期望的错误观察率的最大值,默认为0.05

fwe

默认为0.05

max_categories

特征的最大类别数,默认为1000

样例

inputs = {
    "dataframe": None  # @input {"label":"dataframe","type":"DataFrame"}
}
params = {
    "inputs": inputs,
    "b_output_action": True,
    "b_use_default_encoder": True,
    "input_features_str": "",  # @param {"label":"input_features_str","type":"string","required":"false","helpTip":""}
    "outer_pipeline_stages": None,
    "label_col": "",  # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
    "chi_label_index_col": "label_index",  # @param {"label":"chi_label_index_col","type":"string","required":"true","helpTip":""}
    "chi_features_col": "input_features",  # @param {"label":"chi_features_col","type":"string","required":"true","helpTip":""}
    "chi_output_col": "output_features",  # @param {"label":"chi_output_col","type":"string","required":"true","helpTip":""}
    "selector_type": "numTopFeatures",  # @param {"label":"selector_type","type":"enum","required":"true","options":"numTopFeatures,percentile,fpr,fdr,fwe","helpTip":""}
    "num_top_features": 50,  # @param {"label":"num_top_features","type":"integer","required":"true","range":"(0,2147483647]","helpTip":""}
    "percentile": 0.1,  # @param {"label":"percentile","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "fpr": 0.05,  # @param {"label":"fpr","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "fdr": 0.05,  # @param {"label":"fdr","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "fwe": 0.05,  # @param {"label":"fwe","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "max_categories": 1000  # @param {"label":"max_categories","type":"number","required":"true","range":"(0,2147483647]","helpTip":""}
}
chi_square_selector____id___ = MLSChiSquareSelector(**params)
chi_square_selector____id___.run()
# @output {"label":"dataframe","name":"chi_square_selector____id___.get_outputs()['output_port_1']","type":"DataFrame"}

父主题: 特征工程

同意关联代理商云淘科技,购买华为云产品更优惠(QQ 78315851)

内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家