Research on News Keyword Extraction Based on TF-IDF and Chinese Features
Download as PDF
DOI: 10.23977/fmess.2019.058
Author(s)
Jiapeng Song, Rui Hu, Bingyu Sun, Yin Gu, Wenlin Xiong and Jianqi Zhu
Corresponding Author
Jiapeng Song
ABSTRACT
Keyword extraction technology is the basis of corpus construction, text analysis processing, and information retrieval. For the special carrier of Chinese news text, the traditional TF-IDF algorithm is too dependent on word frequency and cannot handle the drawbacks of Chinese grammar accurately. This paper elaborates on the characteristics of Chinese news text keywords. On the basis of TF-IDF algorithm, it integrates Chinese special features such as part of speech, word length and lexical, and constructs an improved TF-IDF weighting formula that comprehensively considers text features. A scoring method for keyword matching is proposed, and the keywords that are "cut off" by the Chinese word segmentation are reconstituted into formal keywords. Cross-comparison experiments show that the improved algorithm is superior to the traditional algorithm in accuracy, recall and F value, and can correctly and effectively extract Chinese keywords.
KEYWORDS
Tf-idf, chinese news, keywords composition, weight calculation