华为云AI开发平台ModelArts缺省值填充_云淘科技

概述

通过给定一个缺省值的配置表,来实现将输入表的缺省值或固定值填充为定义的值。

将数值型的空值替换为最大值,最小值,均值或者一个自定义的值。
将字符串类型、日期类型的空值、或者固定值,替换为一个自定义的值。
数值型替换可以自定义,也可以直接选择替换成数值最大值,最小值或者均值。

组件配置方式

方式一 : 以配置表方式确定填充策略

输入

参数

子参数

参数说明

inputs

dataDF

inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,输入数据

inputs

paramDF

inputs为字典类型,paramDF为pyspark中的DataFrame类型对象,待修改字段的配置

输出

参数

参数说明

dataDF

inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,填充后的数据输出

paramDF

inputs为字典类型,paramDF为pyspark中的DataFrame类型对象,已被修改字段的配置

样例

dataDF:
+---+----+-----------+--------+---------+-------+----+
|id |age |job        |marital |education|housing|loan|
+---+----+-----------+--------+---------+-------+----+
|0  |59  |admin.     |married |secondary|yes    |no  |
|1  |56  |admin.     |married |secondary|no     |no  |
|2  |41  |technician |married |secondary|yes    |no  |
|3  |55  |services   |married |secondary|yes    |no  |
|4  |54  |admin.     |married |tertiary |no     |no  |
|5  |null|management |single  |tertiary |yes    |yes |
|6  |56  |management |married |tertiary |yes    |yes |
|7  |60  |retired    |divorced|secondary|yes    |no  |
|8  |39  |technician |single  |unknown  |yes    |no  |
|9  |37  |technician |married |secondary|yes    |no  |
|10 |34  |admin.     |married |secondary|no     |no  |
|11 |null|null       |divorced|secondary|yes    |no  |
|12 |28  |services   |single  |secondary|yes    |no  |
|13 |30  |technician |married |secondary|yes    |no  |
|14 |36  |technician |married |secondary|yes    |yes |
|15 |37  |admin.     |single  |secondary|yes    |yes |
|16 |null|blue-collar|married |secondary|yes    |no  |
|17 |53  |services   |divorced|primary  |yes    |yes |
+---+----+-----------+--------+---------+-------+----+
paramDF
+---------+---------------------------------------------------------------------------------------------------------------------+
|feature  |json                                                                                                                 |
+---------+---------------------------------------------------------------------------------------------------------------------+
|age      |{"name":"fillMissingValues","type":"IntegerType","paras":{"missing_value_type":"null","replaced_value":"45.0"}}      |
|job      |{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"null","replaced_value":"blue-collar"}}|
|education|{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"unknown","replaced_value":"primary"}} |
+---------+---------------------------------------------------------------------------------------------------------------------+
结果:
数据输出结果:
+---+---+-----------+--------+---------+-------+----+
|id |age|job        |marital |education|housing|loan|
+---+---+-----------+--------+---------+-------+----+
|0  |59 |admin.     |married |secondary|yes    |no  |
|1  |56 |admin.     |married |secondary|no     |no  |
|2  |41 |technician |married |secondary|yes    |no  |
|3  |55 |services   |married |secondary|yes    |no  |
|4  |54 |admin.     |married |tertiary |no     |no  |
|5  |45 |management |single  |tertiary |yes    |yes |
|6  |56 |management |married |tertiary |yes    |yes |
|7  |60 |retired    |divorced|secondary|yes    |no  |
|8  |39 |technician |single  |primary  |yes    |no  |
|9  |37 |technician |married |secondary|yes    |no  |
|10 |34 |admin.     |married |secondary|no     |no  |
|11 |45 |blue-collar|divorced|secondary|yes    |no  |
|12 |28 |services   |single  |secondary|yes    |no  |
|13 |30 |technician |married |secondary|yes    |no  |
|14 |36 |technician |married |secondary|yes    |yes |
|15 |37 |admin.     |single  |secondary|yes    |yes |
|16 |45 |blue-collar|married |secondary|yes    |no  |
|17 |53 |services   |divorced|primary  |yes    |yes |
+---+---+-----------+--------+---------+-------+----+
配置输出结果:
+---------+---------------------------------------------------------------------------------------------------------------------+
|feature  |json                                                                                                                 |
+---------+---------------------------------------------------------------------------------------------------------------------+
|age      |{"name":"fillMissingValues","type":"IntegerType","paras":{"missing_value_type":"null","replaced_value":"45.0"}}      |
|job      |{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"null","replaced_value":"blue-collar"}}|
|education|{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"unknown","replaced_value":"primary"}} |
+---------+---------------------------------------------------------------------------------------------------------------------+

方式二 : 以配置表方式确定填充策略

输入

参数

子参数

参数说明

inputs

dataDF

inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,输入数据

输出

参数

参数说明

dataDF

inputs为字典类型,dataDF为pyspark中的DataFrame类型对象,填充后的数据输出

paramDF

inputs为字典类型,paramDF为pyspark中的DataFrame类型对象,已被修改字段的配置

参数说明

参数

参数说明

configs

第一个是列名1,被填充值1,填充值1;列名2,被填充值2,填充值2 例:col_double,null,mean;col_string,null-empty,str_type_empty

样例

dataDF:
+---+----+-----------+--------+---------+-------+----+
|id |age |job        |marital |education|housing|loan|
+---+----+-----------+--------+---------+-------+----+
|0  |59  |admin.     |married |secondary|yes    |no  |
|1  |56  |admin.     |married |secondary|no     |no  |
|2  |41  |technician |married |secondary|yes    |no  |
|3  |55  |services   |married |secondary|yes    |no  |
|4  |54  |admin.     |married |tertiary |no     |no  |
|5  |null|management |single  |tertiary |yes    |yes |
|6  |56  |management |married |tertiary |yes    |yes |
|7  |60  |retired    |divorced|secondary|yes    |no  |
|8  |39  |technician |single  |unknown  |yes    |no  |
|9  |37  |technician |married |secondary|yes    |no  |
|10 |34  |admin.     |married |secondary|no     |no  |
|11 |null|null       |divorced|secondary|yes    |no  |
|12 |28  |services   |single  |secondary|yes    |no  |
|13 |30  |technician |married |secondary|yes    |no  |
|14 |36  |technician |married |secondary|yes    |yes |
|15 |37  |admin.     |single  |secondary|yes    |yes |
|16 |null|blue-collar|married |secondary|yes    |no  |
|17 |53  |services   |divorced|primary  |yes    |yes |
+---+----+-----------+--------+---------+-------+----+
configs:
age,null,mean;job,null,blue-collar;education,unknown,primary

结果:
数据输出结果:
+---+---+-----------+--------+---------+-------+----+
|id |age|job        |marital |education|housing|loan|
+---+---+-----------+--------+---------+-------+----+
|0  |59 |admin.     |married |secondary|yes    |no  |
|1  |56 |admin.     |married |secondary|no     |no  |
|2  |41 |technician |married |secondary|yes    |no  |
|3  |55 |services   |married |secondary|yes    |no  |
|4  |54 |admin.     |married |tertiary |no     |no  |
|5  |45 |management |single  |tertiary |yes    |yes |
|6  |56 |management |married |tertiary |yes    |yes |
|7  |60 |retired    |divorced|secondary|yes    |no  |
|8  |39 |technician |single  |primary  |yes    |no  |
|9  |37 |technician |married |secondary|yes    |no  |
|10 |34 |admin.     |married |secondary|no     |no  |
|11 |45 |blue-collar|divorced|secondary|yes    |no  |
|12 |28 |services   |single  |secondary|yes    |no  |
|13 |30 |technician |married |secondary|yes    |no  |
|14 |36 |technician |married |secondary|yes    |yes |
|15 |37 |admin.     |single  |secondary|yes    |yes |
|16 |45 |blue-collar|married |secondary|yes    |no  |
|17 |53 |services   |divorced|primary  |yes    |yes |
+---+---+-----------+--------+---------+-------+----+
配置输出结果:
+---------+---------------------------------------------------------------------------------------------------------------------+
|feature  |json                                                                                                                 |
+---------+---------------------------------------------------------------------------------------------------------------------+
|age      |{"name":"fillMissingValues","type":"IntegerType","paras":{"missing_value_type":"null","replaced_value":"45.0"}}      |
|job      |{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"null","replaced_value":"blue-collar"}}|
|education|{"name":"fillMissingValues","type":"StringType","paras":{"missing_value_type":"unknown","replaced_value":"primary"}} |
+---------+---------------------------------------------------------------------------------------------------------------------+

父主题: 数据处理

同意关联代理商云淘科技,购买华为云产品更优惠(QQ 78315851)

内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家