Education, Science, Technology, Innovation and Life
Open Access
Sign In

NBCE: A Neo4j-Based Content Extraction Algorithm in Threat Intelligence Web Pages

Download as PDF

DOI: 10.23977/CNCI2020040

Author(s)

Xiaoyang Li, Mengming Li, Rongfeng Zheng, Anmin Zhou and Liang Liu

Corresponding Author

Liang Liu

ABSTRACT

Main content extraction is a widely used technique in web crawler, search engines and so on to extract the main content of web pages as well as discarding other complementary and decorative components. By extracting the main content, irrelevant and redundant information can be ignored hence reducing the complexity of data processing and improving the efficiency of further analysis. Among the existing methods tackling this problem, solutions are designed to satisfy the different requirements of various groups. For instance, companies specialized in content extraction always focus more on efficiency and accuracy while others may concentrate more on practicality. In our proposed method, we innovatively present a neo4j-based content extraction algorithm (NBCE) in threat intelligence websites. The NBCE algorithm initially transforms the HTML source code into the form of the tree structure. Then the triples extracted from the HTML tree are used to construct a graph based on neo4j database. Finally, by deciding whether a node is the main content node or not, the main content of the given web page can be extracted. The availability of the proposed method is validated through a set of experiments conducted on a threat-intelligence-related database.

KEYWORDS

Main content extraction; threat intelligence; Neo4j; machine learning

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.