NBCE: A Neo4j-Based Content Extraction Algorithm in Threat Intelligence Web Pages
Download as PDF
DOI: 10.23977/CNCI2020040
Author(s)
Xiaoyang Li, Mengming Li, Rongfeng Zheng, Anmin Zhou and Liang Liu
Corresponding Author
Liang Liu
ABSTRACT
Main content extraction is a widely used technique in web crawler, search engines and so on to extract the main content of web pages as well as discarding other complementary and decorative components. By extracting the main content, irrelevant and redundant information can be ignored hence reducing the complexity of data processing and improving the efficiency of further analysis. Among the existing methods tackling this problem, solutions are designed to satisfy the different requirements of various groups. For instance, companies specialized in content extraction always focus more on efficiency and accuracy while others may concentrate more on practicality. In our proposed method, we innovatively present a neo4j-based content extraction algorithm (NBCE) in threat intelligence websites. The NBCE algorithm initially transforms the HTML source code into the form of the tree structure. Then the triples extracted from the HTML tree are used to construct a graph based on neo4j database. Finally, by deciding whether a node is the main content node or not, the main content of the given web page can be extracted. The availability of the proposed method is validated through a set of experiments conducted on a threat-intelligence-related database.
KEYWORDS
Main content extraction; threat intelligence; Neo4j; machine learning