华为云AI开发平台ModelArts卡方选择_云淘科技
概述
采用卡方检验来进行特征选择。
卡方检验(Chi-Squared Test或χ2 Test)的基本思想是通过特征变量与目标变量之间的偏差大小来选择相关性较大的特征变量。首先假设两个变量是独立的,然后观察实际值与理论值的偏差程度,该偏差程度代表两个变量之间的相关性。若某个特征变量与目标变量之间偏差程度越大,则它们的相关性越高,最后根据相关性对特征变量进行排序,并选择与目标变量相关性较大的特征变量。卡方检验中假设理论值为E,第i个样本的实际值为xi,则偏差程度的计算公式如下:
输入
参数 |
子参数 |
参数说明 |
---|---|---|
inputs |
dataframe |
inputs为字典类型,dataframe为pyspark中的DataFrame类型对象 |
输出
数据集
参数说明
参数 |
子参数 |
参数说明 |
---|---|---|
input_features_str |
– |
输入列名组成的格式化字符串,例如: “column_a” “column_a,column_b” |
label_col |
– |
目标列,基于该列进行卡方检验 |
chi_label_index_col |
– |
目标列通过标签编码得到的新列名,默认为label_index |
chi_features_col |
– |
调用spark卡方选择需要的输入特征向量列名,默认为input_features |
chi_output_col |
– |
调用spark卡方选择需要的输入特征向量列名,默认为output_features |
selector_type |
– |
卡方选择的选择方法,支持numTopFeatures,percentile,fpr,fdr,fwe |
num_top_features |
– |
选择的特征个数,默认为50 |
percentile |
– |
选择的特征个数占原始特征数量的比例,默认为0.1 |
fpr |
– |
最高的p-value,默认为0.05 |
fdr |
– |
期望的错误观察率的最大值,默认为0.05 |
fwe |
– |
默认为0.05 |
max_categories |
– |
特征的最大类别数,默认为1000 |
样例
inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"} } params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, "input_features_str": "", # @param {"label":"input_features_str","type":"string","required":"false","helpTip":""} "outer_pipeline_stages": None, "label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""} "chi_label_index_col": "label_index", # @param {"label":"chi_label_index_col","type":"string","required":"true","helpTip":""} "chi_features_col": "input_features", # @param {"label":"chi_features_col","type":"string","required":"true","helpTip":""} "chi_output_col": "output_features", # @param {"label":"chi_output_col","type":"string","required":"true","helpTip":""} "selector_type": "numTopFeatures", # @param {"label":"selector_type","type":"enum","required":"true","options":"numTopFeatures,percentile,fpr,fdr,fwe","helpTip":""} "num_top_features": 50, # @param {"label":"num_top_features","type":"integer","required":"true","range":"(0,2147483647]","helpTip":""} "percentile": 0.1, # @param {"label":"percentile","type":"number","required":"true","range":"[0,1]","helpTip":""} "fpr": 0.05, # @param {"label":"fpr","type":"number","required":"true","range":"[0,1]","helpTip":""} "fdr": 0.05, # @param {"label":"fdr","type":"number","required":"true","range":"[0,1]","helpTip":""} "fwe": 0.05, # @param {"label":"fwe","type":"number","required":"true","range":"[0,1]","helpTip":""} "max_categories": 1000 # @param {"label":"max_categories","type":"number","required":"true","range":"(0,2147483647]","helpTip":""} } chi_square_selector____id___ = MLSChiSquareSelector(**params) chi_square_selector____id___.run() # @output {"label":"dataframe","name":"chi_square_selector____id___.get_outputs()['output_port_1']","type":"DataFrame"}
父主题: 特征工程
同意关联代理商云淘科技,购买华为云产品更优惠(QQ 78315851)
内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家