吉首大学学报(自然科学版) ›› 2021, Vol. 42 ›› Issue (4): 38-43.DOI: 10.13438/j.cnki.jdzk.2021.04.008

• 计算机与通信 • 上一篇    下一篇

基于FP-Growth算法的词性标注规则获取方法

莫礼平,黄永琨   

  1. (吉首大学信息科学与工程学院,湖南 吉首 416000)
  • 出版日期:2021-07-25 发布日期:2021-11-17
  • 作者简介:莫礼平(1972—),女,湖南安化人,吉首大学信息科学与工程学院教授,硕士,主要从事自然语言处理、智能计算及其应用研究.
  • 基金资助:
    湖南省语委语言文字应用研究专项课题(XYJ2019GB09);湖南省自然科学基金资助项目(2019JJ40234);湖南省教育厅科学研究重点项目(19A414)

Method of Obtaining Part-of-Speech Tagging Rules Utilizing FP-Growth Algorithm

MO Liping, HUANG Yongkun   

  1. (College of Information Science & Engineering, Jishou University, Jishou 416000, Hunan China)
  • Online:2021-07-25 Published:2021-11-17

摘要:为了提高词性标注模型训练语料的质量,设计了一种利用FP-Growth算法从训练语料库中自动获取词性标注规则的方法,并将该方法与基于Apriori算法的词性标注规则获取方法进行了对比实验.实验结果显示,对于0.1万、0.2万和1万词级的小规模语料库,2种方法获取的词性标注规则条数均相同,但基于FP-Growth算法的时间耗费分别仅为基于Apriori算法的0.013 866%,0.010 399%,0.003 132%;对于10万、100万词级的训练语料库,基于Apriori算法无法获取任何规则,但基于FP-Growth算法依然可以在合理时间内获取有效的规则.这说明,基于FP-Growth算法的词性标注规则获取方法是可行且高效的,满足在优化训练语料库时能从不同规模的语料库中自动获取词性标注规则的实际需求.

关键词: 词性标注规则, 语料库, 关联规则挖掘, Apriori算法, FP-Growth算法

Abstract: To improve the quality of the training corpus needed by the part-of-speech (POS) tagging model, a method for acquiring POS tagging rules based on FP-Growth algorithm to automatically extract the POS tagging rules from the training corpus is proposed. A comparative experiment was carried out between the proposed method and the existing method of obtaining POS tagging rules based on Apriori algorithm. The experiment results reveal that, for small-scale training corpora of 1 000, 2 000, and 10 000 words, the number of POS tagging rules obtained by the former is the same as that of the latter, but the time consumption is only 0.013 866%, 0.010 399% and 0.003 132% of the latter, respectively, and for training corpora with a scale of 100 000 words and 1 million words, the latter cannot get any rule, but the former can still obtain effective rules within a reasonable period of time. Obviously, proposed method is feasible and efficient, and can meet the actual needs of automatically obtaining POS tagging rules from corpora of different sizes when optimizing the training corpus.

Key words: part-of-speech tagging rule, corpora, association rule mining, Apriori algorithm, FP-Growth algorithm

公众号 电子书橱 超星期刊 手机浏览 在线QQ