华为云AI开发平台ModelArts缺省值填充_云淘科技
概述
通过给定一个缺省值的配置表,来实现将输入表的缺省值或固定值填充为定义的值。
将数值型的空值替换为最大值,最小值,均值或者一个自定义的值。
将字符串类型、日期类型的空值、或者固定值,替换为一个自定义的值。
数值型替换可以自定义,也可以直接选择替换成数值最大值,最小值或者均值。
组件配置方式
方式一 : 以配置表方式确定填充策略
输入
参数 |
子参数 |
参数说明 |
---|---|---|
inputs |
dataDF |
inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,输入数据 |
inputs |
paramDF |
inputs为字典类型,paramDF为pyspark中的DataFrame类型对象,待修改字段的配置 |
输出
参数 |
参数说明 |
---|---|
dataDF |
inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,填充后的数据输出 |
paramDF |
inputs为字典类型,paramDF为pyspark中的DataFrame类型对象,已被修改字段的配置 |
样例
dataDF: +---+----+-----------+--------+---------+-------+----+ |id |age |job |marital |education|housing|loan| +---+----+-----------+--------+---------+-------+----+ |0 |59 |admin. |married |secondary|yes |no | |1 |56 |admin. |married |secondary|no |no | |2 |41 |technician |married |secondary|yes |no | |3 |55 |services |married |secondary|yes |no | |4 |54 |admin. |married |tertiary |no |no | |5 |null|management |single |tertiary |yes |yes | |6 |56 |management |married |tertiary |yes |yes | |7 |60 |retired |divorced|secondary|yes |no | |8 |39 |technician |single |unknown |yes |no | |9 |37 |technician |married |secondary|yes |no | |10 |34 |admin. |married |secondary|no |no | |11 |null|null |divorced|secondary|yes |no | |12 |28 |services |single |secondary|yes |no | |13 |30 |technician |married |secondary|yes |no | |14 |36 |technician |married |secondary|yes |yes | |15 |37 |admin. |single |secondary|yes |yes | |16 |null|blue-collar|married |secondary|yes |no | |17 |53 |services |divorced|primary |yes |yes | +---+----+-----------+--------+---------+-------+----+ paramDF +---------+---------------------------------------------------------------------------------------------------------------------+ |feature |json | +---------+---------------------------------------------------------------------------------------------------------------------+ |age |{"name":"fillMissingValues","type":"IntegerType","paras":{"missing_value_type":"null","replaced_value":"45.0"}} | |job |{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"null","replaced_value":"blue-collar"}}| |education|{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"unknown","replaced_value":"primary"}} | +---------+---------------------------------------------------------------------------------------------------------------------+ 结果: 数据输出结果: +---+---+-----------+--------+---------+-------+----+ |id |age|job |marital |education|housing|loan| +---+---+-----------+--------+---------+-------+----+ |0 |59 |admin. |married |secondary|yes |no | |1 |56 |admin. |married |secondary|no |no | |2 |41 |technician |married |secondary|yes |no | |3 |55 |services |married |secondary|yes |no | |4 |54 |admin. |married |tertiary |no |no | |5 |45 |management |single |tertiary |yes |yes | |6 |56 |management |married |tertiary |yes |yes | |7 |60 |retired |divorced|secondary|yes |no | |8 |39 |technician |single |primary |yes |no | |9 |37 |technician |married |secondary|yes |no | |10 |34 |admin. |married |secondary|no |no | |11 |45 |blue-collar|divorced|secondary|yes |no | |12 |28 |services |single |secondary|yes |no | |13 |30 |technician |married |secondary|yes |no | |14 |36 |technician |married |secondary|yes |yes | |15 |37 |admin. |single |secondary|yes |yes | |16 |45 |blue-collar|married |secondary|yes |no | |17 |53 |services |divorced|primary |yes |yes | +---+---+-----------+--------+---------+-------+----+ 配置输出结果: +---------+---------------------------------------------------------------------------------------------------------------------+ |feature |json | +---------+---------------------------------------------------------------------------------------------------------------------+ |age |{"name":"fillMissingValues","type":"IntegerType","paras":{"missing_value_type":"null","replaced_value":"45.0"}} | |job |{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"null","replaced_value":"blue-collar"}}| |education|{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"unknown","replaced_value":"primary"}} | +---------+---------------------------------------------------------------------------------------------------------------------+
方式二 : 以配置表方式确定填充策略
输入
参数 |
子参数 |
参数说明 |
---|---|---|
inputs |
dataDF |
inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,输入数据 |
输出
参数 |
参数说明 |
---|---|
dataDF |
inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,填充后的数据输出 |
paramDF |
inputs为字典类型,paramDF为pyspark中的DataFrame类型对象,已被修改字段的配置 |
参数说明
参数 |
参数说明 |
---|---|
configs |
第一个是列名1,被填充值1,填充值1;列名2,被填充值2,填充值2 例:col_double,null,mean;col_string,null-empty,str_type_empty |
样例
dataDF: +---+----+-----------+--------+---------+-------+----+ |id |age |job |marital |education|housing|loan| +---+----+-----------+--------+---------+-------+----+ |0 |59 |admin. |married |secondary|yes |no | |1 |56 |admin. |married |secondary|no |no | |2 |41 |technician |married |secondary|yes |no | |3 |55 |services |married |secondary|yes |no | |4 |54 |admin. |married |tertiary |no |no | |5 |null|management |single |tertiary |yes |yes | |6 |56 |management |married |tertiary |yes |yes | |7 |60 |retired |divorced|secondary|yes |no | |8 |39 |technician |single |unknown |yes |no | |9 |37 |technician |married |secondary|yes |no | |10 |34 |admin. |married |secondary|no |no | |11 |null|null |divorced|secondary|yes |no | |12 |28 |services |single |secondary|yes |no | |13 |30 |technician |married |secondary|yes |no | |14 |36 |technician |married |secondary|yes |yes | |15 |37 |admin. |single |secondary|yes |yes | |16 |null|blue-collar|married |secondary|yes |no | |17 |53 |services |divorced|primary |yes |yes | +---+----+-----------+--------+---------+-------+----+ configs: age,null,mean;job,null,blue-collar;education,unknown,primary 结果: 数据输出结果: +---+---+-----------+--------+---------+-------+----+ |id |age|job |marital |education|housing|loan| +---+---+-----------+--------+---------+-------+----+ |0 |59 |admin. |married |secondary|yes |no | |1 |56 |admin. |married |secondary|no |no | |2 |41 |technician |married |secondary|yes |no | |3 |55 |services |married |secondary|yes |no | |4 |54 |admin. |married |tertiary |no |no | |5 |45 |management |single |tertiary |yes |yes | |6 |56 |management |married |tertiary |yes |yes | |7 |60 |retired |divorced|secondary|yes |no | |8 |39 |technician |single |primary |yes |no | |9 |37 |technician |married |secondary|yes |no | |10 |34 |admin. |married |secondary|no |no | |11 |45 |blue-collar|divorced|secondary|yes |no | |12 |28 |services |single |secondary|yes |no | |13 |30 |technician |married |secondary|yes |no | |14 |36 |technician |married |secondary|yes |yes | |15 |37 |admin. |single |secondary|yes |yes | |16 |45 |blue-collar|married |secondary|yes |no | |17 |53 |services |divorced|primary |yes |yes | +---+---+-----------+--------+---------+-------+----+ 配置输出结果: +---------+---------------------------------------------------------------------------------------------------------------------+ |feature |json | +---------+---------------------------------------------------------------------------------------------------------------------+ |age |{"name":"fillMissingValues","type":"IntegerType","paras":{"missing_value_type":"null","replaced_value":"45.0"}} | |job |{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"null","replaced_value":"blue-collar"}}| |education|{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"unknown","replaced_value":"primary"}} | +---------+---------------------------------------------------------------------------------------------------------------------+
父主题: 数据处理
同意关联代理商云淘科技,购买华为云产品更优惠(QQ 78315851)
内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家