华为云AI开发平台ModelArtsPMI_云淘科技
概述
承接分词结果,计算一个文档里单词两两之间的互信息值(PMI)。PMI计算公式如下:
相关概念解释:
共现对儿:一句话里面如果两个词在句子里的距离小于等于定义的滑动窗口大小,则这两个词共现形成共现对儿。
P(x,y):x,y为两个词,P(x,y)指两个词的共现概率,等于两个词的共现次数N(x, y)除以所有共现对儿的个数N。
P(x):指x与其他词共现的概率,等于x出现在所有共现对儿中的次数N(x)除以所有共现对儿的个数N。
P(y)同理。
输入
参数 |
子参数 |
参数说明 |
---|---|---|
inputs |
input_table |
输入的包含分词后句子的数据表;必选 |
输入参数说明
参数名称 |
参数描述 |
参数要求 |
---|---|---|
doc_col_name |
分词后的文本列 |
string类型;必填;多列时每列当做单独的句子处理 |
doc_sep |
分词列中的词分隔符 |
string类型;必填;默认为” “ |
min_count |
最小词频 |
integer类型;非必填;默认为5,小于该值的词会被过滤掉,不填则识别为0,取值范围[0,2147483647] |
window_size |
滑动窗口大小 |
integer类型;非必填;默认为整行,取值范围[1, 2147483647] |
partitions |
数据重分区数 |
integer类型;非必填;取值范围[1,5000]; |
partitions
大数据量情况下建议partitions重分区数取大一些,100w长文本数据建议取1000,500w长文本数据建议取2000,如果在前两种场景下用户自定义partitions小于需求值,系统会自动替换为需求值(即前面的1000,2000)。
资源配置
数据量较大时建议采用更大的GPU资源配置,可以设置executor memory大一些,参考配置如下:
cluster 32配置:
–executor-memory 8G \
–executor-cores 2 \
–num-executors 14 \
–driver-cores 4 \
–driver-memory 15G \
cluster 64配置:
–executor-memory 24G \
–executor-cores 6 \
–num-executors 10 \
–driver-cores 4 \
–driver-memory 15G \
参数配置
若运行效率过慢,可考虑增大资源配置,或修改min_count、window_size参数,min_count大一些,window_size小一些。
输出
参数 |
子参数 |
参数说明 |
---|---|---|
output |
output_port_1 |
输出表表名;标签为dataframe |
输出表说明
列名 |
列名描述 |
---|---|
word1 |
共现词对儿的第一个单词 |
word2 |
共现词对儿的第二个单词 |
word1_count |
word1出现在所有共现词对儿中的次数 |
word2_count |
word2出现在所有共现词对儿中的次数 |
co_occurrence_count |
(word1, word2)共现词对儿的个数 |
pmi |
word1与word2的PMI值 |
样例
数据输入
input_table
input |
Try to try it how to try it |
Need to try it |
You try to do do something |
How can you these days still not try it not do anything |
It is a good chance to try also you can do it |
You are right that it is a good chance to try |
配置流程
运行流程
输入参数
输出结果
word1 |
word2 |
word1_count |
word2_count |
co_occurrences_count |
pmi |
You |
a |
11 |
16 |
1 |
-0.36646 |
You |
chance |
11 |
16 |
1 |
-0.36646 |
You |
do |
11 |
23 |
2 |
-0.03622 |
You |
good |
11 |
16 |
1 |
-0.36646 |
You |
is |
11 |
16 |
1 |
-0.36646 |
You |
it |
11 |
34 |
1 |
-1.12023 |
You |
to |
11 |
32 |
2 |
-0.36646 |
You |
try |
11 |
38 |
2 |
-0.53831 |
a |
can |
16 |
15 |
1 |
-0.67662 |
a |
chance |
16 |
16 |
2 |
-0.04801 |
a |
do |
16 |
23 |
1 |
-1.10406 |
a |
good |
16 |
16 |
2 |
-0.04801 |
a |
is |
16 |
16 |
2 |
-0.04801 |
a |
it |
16 |
34 |
2 |
-0.80178 |
a |
to |
16 |
32 |
2 |
-0.74116 |
a |
try |
16 |
38 |
2 |
-0.91301 |
a |
you |
16 |
15 |
1 |
-0.67662 |
can |
chance |
15 |
16 |
1 |
-0.67662 |
can |
do |
15 |
23 |
2 |
-0.34638 |
can |
good |
15 |
16 |
1 |
-0.67662 |
can |
is |
15 |
16 |
1 |
-0.67662 |
can |
it |
15 |
34 |
2 |
-0.73724 |
can |
not |
15 |
12 |
2 |
0.304211 |
can |
to |
15 |
32 |
1 |
-1.36977 |
can |
try |
15 |
38 |
2 |
-0.84847 |
can |
you |
15 |
15 |
2 |
0.081068 |
chance |
do |
16 |
23 |
1 |
-1.10406 |
chance |
good |
16 |
16 |
2 |
-0.04801 |
chance |
is |
16 |
16 |
2 |
-0.04801 |
chance |
it |
16 |
34 |
2 |
-0.80178 |
chance |
to |
16 |
32 |
2 |
-0.74116 |
chance |
try |
16 |
38 |
2 |
-0.91301 |
chance |
you |
16 |
15 |
1 |
-0.67662 |
do |
do |
23 |
23 |
1 |
-1.46697 |
do |
good |
23 |
16 |
1 |
-1.10406 |
do |
is |
23 |
16 |
1 |
-1.10406 |
do |
it |
23 |
34 |
2 |
-1.16469 |
do |
not |
23 |
12 |
2 |
-0.12323 |
do |
to |
23 |
32 |
3 |
-0.6986 |
do |
try |
23 |
38 |
4 |
-0.58276 |
do |
you |
23 |
15 |
2 |
-0.34638 |
good |
is |
16 |
16 |
2 |
-0.04801 |
good |
it |
16 |
34 |
2 |
-0.80178 |
good |
to |
16 |
32 |
2 |
-0.74116 |
good |
try |
16 |
38 |
2 |
-0.91301 |
good |
you |
16 |
15 |
1 |
-0.67662 |
is |
it |
16 |
34 |
2 |
-0.80178 |
is |
to |
16 |
32 |
2 |
-0.74116 |
is |
try |
16 |
38 |
2 |
-0.91301 |
is |
you |
16 |
15 |
1 |
-0.67662 |
it |
it |
34 |
34 |
1 |
-2.2487 |
it |
not |
34 |
12 |
2 |
-0.5141 |
it |
to |
34 |
32 |
7 |
-0.24217 |
it |
try |
34 |
38 |
8 |
-0.28048 |
it |
you |
34 |
15 |
2 |
-0.73724 |
not |
not |
12 |
12 |
1 |
-0.16579 |
not |
try |
12 |
38 |
2 |
-0.62532 |
not |
you |
12 |
15 |
2 |
0.304211 |
to |
to |
32 |
32 |
1 |
-2.12745 |
to |
try |
32 |
38 |
8 |
-0.21986 |
to |
you |
32 |
15 |
1 |
-1.36977 |
try |
try |
38 |
38 |
1 |
-2.47115 |
try |
you |
38 |
15 |
2 |
-0.84847 |
父主题: 文本
同意关联代理商云淘科技,购买华为云产品更优惠(QQ 78315851)
内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家