Research on vegetable bundling decisions based on K-means cluster analysis

: With the improvement of the quality of life, it has become a trend to buy vegetables in fresh agricultural products supermarkets. In the actual sales process, fresh produce usually increases supermarket revenue through bundling, so it is important to study the degree of association between each vegetable category and the correlation between single vegetable products for the bundling decision of supermarkets. In this paper, SPEARMAN correlation analysis and K-MEANS cluster analysis method are adopted to study the sales volume and pattern of each vegetable category and single vegetable product, and the time series analysis method is used to analyze the seasonal sales rules of single vegetable product and each vegetable category. This paper finds that the sales volume often reaches the maximum in winter. Finally, the optimal bundling decision and seasonal replenishment strategy are obtained according to the correlation between the vegetable category and each vegetable item and the maximum winter sales.


Introduction
The proportion of fresh agricultural products in the sales of major supermarket chains has gradually increased, becoming the core commodity of supermarket chains to attract customers.In the actual sales process, fresh agricultural products are often bundled with each other to increase supermarket revenue.Therefore, it is of great significance to study the degree of association between agricultural products for supermarkets to make bundled sales decisions.In this paper, Spearman's correlation coefficient is used to analyse the degree of association between agricultural products in order to determine the general direction of bundled sales.When analysing each vegetable item, due to the excessive number of items and the lack of data, the Spearman correlation coefficient analysis could not accurately analyse the correlation coefficient between the items, so this paper adopts the Kmeans cluster analysis method to classify each vegetable item into three major categories, and draws the conclusion that the vegetable items classified into the same category are more correlated with each other, which provides ideas for the hypermarket to make the decision of bundled sales.

Total sales by category
Firstly, each individual product is categorised according to its category.The distribution pattern of sales volume of each category is analysed with basic statistics.In this paper, according to the sales data of 2020.7.1-2023.6.30, the single product category is classified and summarised to obtain the total sales volume of each vegetable category, and each category accounts for the total sales volume as shown in Fig 1 .It can be seen that the flower and leaf category has the largest sales, accounting for 42 per cent of the total sales, and the eggplant category has the smallest sales, accounting for 5 per cent of the total sales.

Time distribution by category
In this paper there is often a correlation between the sales volume of vegetable items and time [1] .So a correlation analysis is to be made between the sales quantity and time for each category.Based on the change in sales quantity of each category with time, a graph of sales quantity with time is derived.

Figure 2: Sales volume by category over time
The horizontal coordinate of the Fig2 is the number of days of sales, 0 represents 1 July 2020, and the vertical coordinate represents the sales volume (kg) on that day.From the graph, it can be seen that the sales volume of each category shows an increasing and then decreasing trend from July in each year, and the sales volume reaches the maximum value in the winter time of each year, which has a strong seasonality [2] .Comparing the six categories of graphs vertically, this paper finds that sales tend to reach their maximum in winter.

Total sales of each individual product
The total sales volume of each item was ranked and the top 10 items in terms of total sales volume are shown in Table 1.

Average daily sales volume of individual products
The daily sales of vegetables were filtered through an Excel spreadsheet to filter out the largest average daily sales of vegetables and the results are shown in Table 2.

Relationship analysis by category
In order to explore the potential correlations between the various vegetable categories, a study using correlation analysis is required.Correlation analysis is the process of analysing the degree of correlation between variables and deriving the correlation coefficient.According to the literature, there are two ways of calculating correlation coefficients, which need to be chosen according to different data types: Pearson correlation coefficient [3] is used when the data are quantitative and satisfy the normal distribution [4] , and Spearman correlation coefficient [5] is used when the data are quantitative but do not satisfy the normal distribution.As can be seen from Fig 3, the data for total sales volume was found to be not normally distributed.Therefore Spearman's correlation coefficient was used to portray the degree of correlation between the sales volume of different vegetable categories: The heat map of correlation coefficient of sales volume data of each vegetable category is calculated, as shown in Fig. 4.Among them, the correlation coefficient of eggplant and cauliflower is larger, 0.889.In the actual sales process, merchants can bundle the vegetable categories with larger correlation coefficients to increase sales.

Relationship between individual products
Due to the large number of single product categories, the heat map derived from Spearman correlation analysis is less effective and not intuitively clear enough.So here K-means clustering analysis [6] is performed for each vegetable individual product, which is classified into the same category of vegetable individual products with higher correlation [7] .
Firstly, k initial clustering centres ( 1)  are randomly selected from the dataset , and the Euclidean distance between the remaining data objects and the clustering centres i C is calculated [8] : Where, x is the data object,  The contour coefficient is an important indicator for evaluating the clustering results, the closer the contour coefficient is to 1, the better the clustering effect is [9] .Although the contour coefficient is closer to 1 when the number of classifications is 2, it lacks practical significance because the number of classifications is too small.Therefore, this paper selects the number of categories when the contour coefficient is 0.902,the results are shown in Fig. 5. i.e., each single product is divided into three categories.The calculated clustering centre of category 1 is 588, the clustering centre of category 2 is 25510, and the clustering centre of category 3 is 8141 [10] .The results are shown in Fig. 6.The percentage of categories is then analysed.There were 212 items in category 1, 4 items in category 2 and 30 items in category 3.After analysis it was found that the highest number of clusters were common dishes, the second highest were side dish dishes and the lowest number were other dishes.The percentage is shown in

Conclusions
A large amount of price data of vegetables provide a basis for studying the distribution rules of single products and categories of vegetables.This paper analyzed the time distribution of vegetable sales using time series images and found that the sales tend to reach the maximum in winter, which is related to the inability of vegetables to grow in winter.Spearman The correlation coefficient provides a tool for the study of the correlation between vegetable categories.By studying the correlation between the vegetable categories, the bundling sales of each vegetable category is conducive to

Figure 1 :
Figure 1: Three-dimensional pie chart of sales share of various types of vegetables

Figure 3 :
Figure 3: Normal Q-Q plot of total sales volume

Figure 4 :
Figure 4: Heat map of Spearman's correlation coefficient for vegetable category

iC
is the i th clustering centre, m is the dimension of the data object, j x ij C is the j th attribute value of x and i C .The sum of squared errors SSE for the entire dataset is calculated as: of SSE indicates the goodness of the clustering result and is the number of clusters.

Figure 6 :
Figure 6: Graph of clustering results

Fig 7 ,
category 1 has the largest percentage, about 86%.Category II has the smallest share, about 2 percent.

Figure 7 :
Figure 7: Percentage of each category

Table 1 :
Total Sales Ranking

Table 2 :
Table of average daily sales