Application of Data Mining Technology in the Recall of Defective Automobile Products in China ——A Typical Case of the Construction of Digital China

According to multisource quality safety data of defective automobile products, key quality safety factors of defective automobile products are extracted, a defect information indicator system for automobile products is systematically constructed and a correlated graph is established between quality safety factors. Based on the optimization and correlation of the quality safety factor indicator system, Big Data technology is used to design a data structure for multisource quality safety information cluster, develop a data platform for the defect information analysis of automobile products and achieve information clustering and correlation analysis based on multisource quality safety data, providing technical support for the recall management of defective automobile products.


I. INTRODUCTION
The famous American quality management specialist Joseph Juran proposed that the 21st century will be the "century of quality", and the basic aspects of the government regarding quality management include measurement; standards; products and activities involving the safety and health of citizens; activities involving national security, such as currency, exports and government procurement, etc. In terms of management, government departments shall mainly regulate the products and activities involving the safety and health of citizens. The automobile industry is an important economic industry in China, and automobiles are closely related to the daily life of ordinary people, so it has been an important function of government regulation to ensure the quality safety of automobile products, timely detect the defects and effectively eliminate automobile defects. The recall of defective automobile products is an important means for the government to exercise post-market regulation on the management of product quality. China has started to recall automobiles in 2004 with 56.7377 million vehicles involved, including 307 recalls due to defect investigation involving 33.0214 million vehicles, accounting for 38% of the total, which effectively safeguarded the personnel and property safety of consumers and maintained the social public safety.

MANAGEMENT PLATFORM FOR DEFECTIVE AUTOMOBILE PRODUCTS
The National Defective Automobile Product Recall Integrated Management Information Platform has been designed and developed since 2004. From 2012 to 2016, the platform was integrally optimized, improved and perfected. It is composed of multiple sub-systems including automobile enterprise filing information management system, automobile defect information acquisition system, automobile recall reporting and data management system, automobile product defect investigation and management system, overseas automobile recall system and public opinion acquisition and information monitoring system. Moreover, the information platform mainly provides such public services as news and warning information issue of defective product recalls, information report on consumer defective products and range query of the recall of defective automobile products. In 2017, website visits reached 6.64 million, and WeChat users totaled more than 180,000.
With the rapid growth of recall data of defective automobile products in China, such data resources as defect information collection data, filing data, recall data and overseas warning information rapidly increase, and the data may come from enterprises, consumers, local quality inspection departments, technical institutions and research institutes, etc. There are various types of document formats including formatting, video and voice. Due to design issues of the system architecture, connectivity cannot be achieved between multisource data. Meanwhile, interface calls between multisource data and business systems increase rapidly with the increasing demand for user data query and service. In the context of "Internet+", it is confronted with new challenges and opportunities for figuring out how to serve the government and consumers with the automobile recall data. The National Defective Automobile Product Recall Integrated Management Information Platform involved in this paper integrates business system filing and acquisition information, product recall information, accident investigation information and injury surveillance to form a defective product data information warehouse, achieving connectivity of multisource data information and solving the island problem of business system. The business demands based on data integration and data mining of automobile products refer mainly to building a big data platform based on these business system data resources, strengthening the development and utilization and in-depth application of data resources and setting up the data warehouse using the concept, technology and resources of big data, so as to achieve connectivity of multisource diversified data, reduce data calls between business systems, provide standardized information analysis and processing mode, achieve intelligent label of information consultation case data and provide normalization processing of cases, provide visualized services for data query and statistics and security interfaces for information release and sharing while effectively safeguarding the security application of data resources. Those demands mainly include: (1) Get through existing databases and unstructured data, fully achieve data association and pull-through, establish data warehouse by means of big data processing and achieve correlated query, statistics and visual and dynamic demonstration of commonly used data; (2) According to the data in defect label dictionary, push and label the dictionary labels of the defect reporting information, technical service announcement, domestic and overseas recalls and online public opinions in an intelligent way based on defect description, so as to standardize and intelligentize data information; (3) Correlate and get through automobile defect data in the data warehouse and establish automatic correlation integration for data information of automobile products through specific association, so as to provide data support for the automatic processing of automobile product defect complaint consultation data.
(4) According to the needs of automobile product defect information and recall information disclosure, make automatic statistics of automobile product defect complaints, complaint handling, sales data and recall data, so as to realize the visual display of key data; (5) Replace existing business systems to cross-call data, establish universal data interfaces and personalized data interfaces that meet the needs of social services and improve the efficiency of data interaction to the millisecond level, providing technical support for the comprehensive development of social data services.
(6) Develop big data analysis functions such as vehicle and owner portraits for the automobile product recall business, providing support for recall business and reserving for social services.

IV. SOLUTION AND OVERALL ARCHITECTURE
The National Defective Product Recall Data Application Platform is based on Baifendian's BDOS system, and is an integrated solution for data collection, integration, processing, storage and consumption of automobile defective product recall business. The platform provides a visual operation interface by encapsulating the underlying components, which greatly reduces the difficulty in using the big data technology by users, and provides association, query, statistics and display services of cross-business system data.

A. BDOS Technology
The Big Data Operating System (BDOS) is an open architecture that integrates open source components such as Hadoop, Storm, Spark, HDFS and HBase, encapsulates and enhances open source components, and provides stable storage, processing and analysis functions for mass data. BDOS can be seamlessly compatible with mainstream Hadoop versions (Cloudera CDH, Hortonworks HDP and Apache Community version) to ensure the wide system availability. BDOS provides multi-type and multi-mode data acquisition modules, distributed real-time and offline computing frameworks, built-in script development IDE and standardized open interfaces. A wide range of application functions can be built on top of BDOS, such as data warehouse, user portrait, knowledge graph, deep learning and text analysis, etc. Based on BDOS's distributed, highly available coordination service capability, it can effectively analyze mass data, release the potential value of core data and achieve closed-loop application of end-to-end data.
The National Defective Product Recall Data Application Platform is mainly based on BDOS's multivariate heterogeneous data integration, data factory and other functions. The platform uses the collection function provided by BDOS to uniformly access multivariate heterogeneous data from various business systems and save to HDFS. Concrete data types include structured data such as SQL Server, Oracle and other business databases as well as semi-structured and unstructured data such as Excel, Word, PDF and pictures. The platform enables to apply batch processing to TB and even PB data using MR, Hive and other technologies. Data after access will be scrubbed, pulled through and integrated in BDOS's data factory. Data factory provides a complete big data solution that covers data development, management and analysis. Users can process massive raw data into valuable market data in a short time. The system will automatically schedule and manage the cluster resources and does not require the users to pay attention to the implementation and operation and maintenance details of the cluster. After the data processing task nodes are established, the technology may define the process of visualizing and dragging and pulling those task nodes by means of workflows, and achieve unified scheduling and monitoring. Users may independently manage operation deployment, operation priority and production monitoring and operation. The platform adopts multi-level data storage and access security mechanisms such as triple backup, read-write request authentication, data desensitization and encryption, and provides simple, user-friendly and professional data security services for the system's data development and data applications, so as to ensure security and privacy of user data during storage, transmission and use.

B. System Implementation
The logical architecture of the National Defective Automobile Product Recall Integrated Management Information Platform includes data sources, data access, data distribution, data storage and processing and data services. The platform achieves the design and development of such functions as correlation and integration of multi-source heterogeneous data in multiple business systems, standardized and centralized processing of data during data query and statistics, development of security interfaces for information release and data sharing and consultation of defect information of automobile products, correlation and integration of multi-source data, data reconstruction of consultation database, labeling of historical consultation cases, retrieval of data in consultation case library and visual display. The business architecture of the platform can be divided into 5 layers, separately are SRC layer, ODS layer, DW layer, DM layer and APP layer. The data generation layer is the data source of the data platform, and the platform accesses defect information acquisition system, information filing system, recall reporting and data management system, public opinion monitoring system, foreign recall and other data. The data exchange layer involves the acquisition of data at the data generation layer. Currently visible data includes structured data and data of such types of PDF, Word and pictures. The data storage and computing layer can be decomposed into four parts, separately are historical archived data segment, total data segment, theme data segment and management analysis and application data segment. Different data segments fit with corresponding data processing modes to achieve the data application requirements of the core business for the data platform. The data application layer achieves business application based on different data segments. The application layer can be divided into three parts, separately are historical query application, big data analysis application and management analysis application. The user access layer starts from different personnel and supports the use requirements of various business personnel for data platform, of which the personnel can be classified mainly as operation and maintenance personnel/management and control personnel, business personnel, management personnel, decision-making personnel and data scientist. The overall architecture of the National Defective Automobile Product Recall Integrated Management Information Platform is shown in Fig. 2. In the first stage, the National Defective Automobile Product Recall Integrated Management Information Platform sorts out the internal data of each business system, establishes ER relationship and data dictionary between databases of different business systems, gets through existing databases and unstructured data, fully achieves data association, applies big data technology to establishing data warehouse, achieves correlated query and statistics display of major business data and establishes a unified platform. In the second stage, the platform consolidates data capability, establishes a visual data operating platform, centrally manages structured and unstructured data, establishes universal data interfaces and personalized data interfaces that meet the needs of social services, and improves the efficiency of data interaction to the millisecond level, providing technical support for the comprehensive development of social data services. It deepens business support, replaces cross-call data of existing business systems and the mode of directly providing respective social services based on the characteristics of automobile product recall business, and enables the data platform to process data and connect service windows in a unified way, so as to ensure security and interaction efficiency for association and integration of multi-source heterogeneous data. In the third stage, it uniformly manages the data, establishes product life-cycle management of data and gets through the data process geared to the automobile product recall business. In addition, it also develops big data modeling analysis functions and products such as defect prediction model of automobile products, defect prediction model of key consumables and quality portrait of vehicles and key consumables, providing data support for the recall of defective automobile products, reserving for social services, achieving data query, statistics and visualization services, improving business efficiency and releasing data value.

V. INNOVATIVENESS OF DATA PLATFORM
In line with the principle that "food and fodder should go ahead of troops and horses", it provides data support and reserve for the government to construct the big data regulation model, strengthen in-process and post-process regulation, enhance the government's scientific decision-making and risk anticipation in recalls and emergencies and conduct in-depth mining, defect investigation, information disclosure and social data services based on the development and application of the data platform using data mining technology. The development of the platform in China's automobile industry is innovative: Firstly, it integrates the multi-source quality safety information of automobile product database and unstructured data, adopts Hadoop and other big data technologies to establish the quality safety information database warehouse of automobile products, fully achieves data association and integration and constructs the typical failure knowledge graph of automobile products; secondly, it achieves intelligent processing of multi-source quality safety information of automobile products, automatically processes by term the failures in consumer complaint information, failures in the announcement of automobile technical services, failures in domestic and overseas recall information and titles of online public opinions on vehicles, then automatches the failure assembly and typical failure labels in failure expert knowledge base, offers intelligent recommendations for the most matched failure labels and the severity level for data analysis engineers to standardize the defect information, and enhances the processing efficiency of defect information over 40%; thirdly, establishes a quantitative evaluation system for product safety risk level. At present, the automobile safety information analysis comes mainly from owner complaints, so as to determine that the data source of automobile safety risks is single and the evaluation indicator system is unsound. The data platform forms typical failure cases, extracts key quality safety factors of defective automobile products and systematically constructs the defect information indicator system of automobile products by clustering and associating the multi-source quality safety data information of automobile products, providing data support for the defect technology investigation and research of automobile products.
At the technical level, the National Defective Automobile Product Recall Integrated Management Information Platform collects and integrates multi-source and multi-format data using HDFS, YARN, MapReduce, Hive, ElasticSearch and other components under Hadoop ecosystem, avoiding the data island between business systems. The platform provides one-stop and full-link big data life cycle management tools and complete data governance mechanism for unified storage, scrubbing and processing of massive multi-source heterogeneous data. The platform also supports the function of metadata management to monitor the total amount, increment and health of data and the relationship between different data sources in a visual way. Based on data managed by the platform, it further develops multi-dimensional data analysis and modeling and mine data value, providing multi-angle and multi-layer data support service for the recall of automobile products. The platform uses key information such as defect labels to get through the association relationship among consumers, vehicles and producers, so as to facilitate the users to locate defect compliant, information filing, recall report, public opinion information and other associated information of multiple data sources through unified query function.

VI. CONCLUSION
In recent years, China's vehicle sales have always shown an upward trend. In 2017, China's automobile sales exceeded 28 million and recalled over 20 million defective vehicles throughout the year, effectively protect the personal and property safety of consumers by recalling defective vehicles. Acquisition, analysis and processing of vehicle safety information provide a data foundation for the determination of defects in automobile products. In 2017, defect investigation clues of automobile products by the AQSIQ nearly came from the analysis results of multi-source quality safety information of automobile products by the data platform, greatly improving the management and decision-making capabilities of the AQSIQ to recall automobile products. At present, it has provided the pubic with quality safety information query services of automobile products on the website of the Central People's Government of the PRC, related website of AQSIQ, the website of defective product management center and WeChat relying on the technology and data of the platform, making positive contributions to meeting the need of consumers for recall query, facilitating effective implementation of recall activities, strengthening effective prevention of quality safety risks, enhancing product quality and promoting benign, sound and orderly development of the economic market. Statistics of various aspects show that China's automobile enterprises have invested over 20 billion yuan in recall activities, and the preliminary application of data platform has achieved remarkable results in social and economic benefits.