NBCE: A Neo4j-Based Content Extraction Algorithm in Threat Intelligence Web Pages

Xiaoyang Li; Mengming Li; Rongfeng Zheng; Anmin Zhou and Liang Liu

doi:10.23977/CNCI2020040

NBCE: A Neo4j-Based Content Extraction Algorithm in Threat Intelligence Web Pages

Download as PDF

DOI: 10.23977/CNCI2020040

Author(s)

Xiaoyang Li, Mengming Li, Rongfeng Zheng, Anmin Zhou and Liang Liu

Corresponding Author

Liang Liu

ABSTRACT

Main content extraction is a widely used technique in web crawler, search engines and so on to extract the main content of web pages as well as discarding other complementary and decorative components. By extracting the main content, irrelevant and redundant information can be ignored hence reducing the complexity of data processing and improving the efficiency of further analysis. Among the existing methods tackling this problem, solutions are designed to satisfy the different requirements of various groups. For instance, companies specialized in content extraction always focus more on efficiency and accuracy while others may concentrate more on practicality. In our proposed method, we innovatively present a neo4j-based content extraction algorithm (NBCE) in threat intelligence websites. The NBCE algorithm initially transforms the HTML source code into the form of the tree structure. Then the triples extracted from the HTML tree are used to construct a graph based on neo4j database. Finally, by deciding whether a node is the main content node or not, the main content of the given web page can be extracted. The availability of the proposed method is validated through a set of experiments conducted on a threat-intelligence-related database.

KEYWORDS

Main content extraction; threat intelligence; Neo4j; machine learning

NBCE: A Neo4j-Based Content Extraction Algorithm in Threat Intelligence Web Pages

Author(s)

Corresponding Author

ABSTRACT

KEYWORDS

RESOURCES

JOIN US

PUBLICATION SERVICES

CONTACT US