华为云AI开发平台ModelArts文本TF-IDF_云淘科技 – 华为云河南代理-西数云-郑州云淘科技有限公司

首页 > 华为云产品及优惠 > AI开发平台ModelArts

AI开发平台ModelArts

华为云AI开发平台ModelArts文本TF-IDF_云淘科技

5 月 16, 2023

122 0

概述

文本TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库出现的频率成反比下降。文本TF-IDF用于展示文本基于词频统计的输出，经TF-IDF加权的结果。

输入

参数	子参数	参数说明
inputs	dataframe	inputs为字典类型，dataframe为pyspark中的DataFrame类型对象，一般为词频统计后的结果。

输出

参数	子参数	参数说明
output	output_port_1	output为字典类型，output_port_1为pyspark中的DataFrame类型对象，为TF-IDF的结果。

参数说明

参数	是否必选	参数说明	默认值
id_col	是	标识文章ID的列名，仅可指定一列	“id”
word_col	是	word列名，仅可指定一列	“word”
count_col	是	count列名，仅可指定一列	“count”
doc_count_col	否	指定doc_count列名	“doc_count”
total_word_count_col	否	指定total_word_count列名	“total_word_count”
total_doc_count_col	否	指定total_doc_count列名	“total_doc_count”
tf_col	否	指定TF列名	“tf”
idf_col	否	指定IDF列名	“idf”
tfidf_col	否	指定TF-IDF列名	“tfidf”

样例

数据样本

id	sentence
1	ball ball fun planet galaxy
2	referendum referendum fun planet planet
3	planet planet planet galaxy ball
4	planet galaxy planet referendum ball

配置流程

运行流程

参数设置

结果查看

id	word	count	doc_count	total_word_count	total_doc_count	tf	idf	tfidf
1	galaxy	1	3	5	4	0.2	0.223144	0.044629
1	fun	1	2	5	4	0.2	0.510826	0.102165
1	ball	2	3	5	4	0.4	0.223144	0.089257
1	planet	1	4	5	4	0.2	0	0
2	fun	1	2	5	4	0.2	0.510826	0.102165
2	planet	2	4	5	4	0.4	0	0
2	referendum	2	2	5	4	0.4	0.510826	0.20433
3	ball	1	3	5	4	0.2	0.223144	0.044629
3	planet	3	4	5	4	0.6	0	0
3	galaxy	1	3	5	4	0.2	0.223144	0.044629
4	ball	1	3	5	4	0.2	0.223144	0.044629
4	planet	2	4	5	4	0.4	0	0
4	galaxy	1	3	5	4	0.2	0.223144	0.044629
4	referendum	1	2	5	4	0.2	0.510826	0.102165

父主题： 文本

同意关联代理商云淘科技，购买华为云产品更优惠（QQ 78315851）

内容没看懂？不太想学习？想快速解决？有偿解决：联系专家

开发环境文本文本TF-IDF