Education, Science, Technology, Innovation and Life
Open Access
Sign In

Research on News Keyword Extraction Based on TF-IDF and Chinese Features

Download as PDF

DOI: 10.23977/fmess.2019.058

Author(s)

Jiapeng Song, Rui Hu, Bingyu Sun, Yin Gu, Wenlin Xiong and Jianqi Zhu

Corresponding Author

Jiapeng Song

ABSTRACT

Keyword extraction technology is the basis of corpus construction, text analysis processing, and information retrieval. For the special carrier of Chinese news text, the traditional TF-IDF algorithm is too dependent on word frequency and cannot handle the drawbacks of Chinese grammar accurately. This paper elaborates on the characteristics of Chinese news text keywords. On the basis of TF-IDF algorithm, it integrates Chinese special features such as part of speech, word length and lexical, and constructs an improved TF-IDF weighting formula that comprehensively considers text features. A scoring method for keyword matching is proposed, and the keywords that are "cut off" by the Chinese word segmentation are reconstituted into formal keywords. Cross-comparison experiments show that the improved algorithm is superior to the traditional algorithm in accuracy, recall and F value, and can correctly and effectively extract Chinese keywords.

KEYWORDS

Tf-idf, chinese news, keywords composition, weight calculation

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.