Education, Science, Technology, Innovation and Life
Open Access
Sign In

Research on Hive Integration Application of Data Warehouse Based on Big Data Platform

Download as PDF

DOI: 10.23977/ICAMCS2022.033

Author(s)

Haixia Wang, Shuguang Cui

Corresponding Author

Shuguang Cui

ABSTRACT

Hive query data has a high delay because there is no index, so the whole database table needs to be scanned, and after HQL is converted into MapReduce program, the execution is delayed. Relatively speaking, the database latency is low; But if the data scale is very large, Hive's parallel computing can show its advantages. In this paper, the application research of Hive integration of data warehouse based on big data platform is launched. Based on the analysis and design of Hive-based online learning data warehouse, this paper puts forward a concrete implementation scheme of online learning data warehouse, which can provide high scalability by combining the virtualization technology of university cloud platform. According to different sources and formats, it is generally necessary to customize different data extraction and conversion tools, write data cleaning programs, check the consistency of data, and load complete data into the data warehouse environment on a regular basis through data loading. While ETL is implemented, this paper realizes the script of deleting fixed partitions according to the characteristics that all tables in Hive are stored by date partitions, and also uses Shell commands to execute it.

KEYWORDS

Big data; Data warehouse; Hive

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.