A Functional Data Classification Model Utilizing Functional Mahalanobis Distance and Regenerative Kernel Methods

: The classification of functional data is an important research direction in modern data mining. In this paper, we propose a similarity measurement method for functional data based on functional Mahalanobis distance and regenerative kernel theory, considering the scenario where the predictor variable is a random function and the response variable is a categorical scalar. This method is then applied to functional kernel principal component analysis. During the classification phase, classic algorithms such as support vector machines and random forests can be combined to accomplish the task of classifying functional data. In empirical analysis, compared to the regenerative kernel based on Euclidean distance and the Euclidean distance regenerative kernel based on B-spline basis functions, the proposed method achieves better classification results. Furthermore, this similarity measurement can also be utilized in other machine learning algorithms based on regenerative kernel theory, thereby developing corresponding analysis methods for functional data.


Introduction
1~2] In 1982, Canadian statistician Ramsay introduced the concept of functional data, [3] which refers to data where the observed values exhibit linear or nonlinear, as well as multidimensional relationships manifested as smooth curves or continuous functions.Compared to traditional multivariate data, functional data have the characteristic of infinite dimensionality.
The classification of functional data methods is essentially an extension of multivariate statistical methods and can generally be divided into filtering and regularization.Filtering involves selecting a suitable set of basis functions, and then using traditional classification methods for the coefficients.For example, the function data can be expanded using B-spline basis functions, and then the coefficients of the basis functions can be classified using support vector machine (SVM) algorithm.The final effectiveness of this method depends entirely on the choice of basis functions.For instance, Pourshoghi et al [4] .used B-spline basis functions to expand the data and then applied SVM for classification.Regularization involves feature selection on functional data to obtain the most crucial data points or intervals, achieving dimension reduction from infinite to finite dimensions, followed by using machine learning algorithms for classification.For example, Jin Haibo and Ma Haiqiang [5] proposed a segmented method for extracting feature points from functional data; Pini and Vantini [6] presented a hypothesis testing-based method for interval classification of functional data.In addition to these two main strategies, many researchers have explored the classification of functional data from other perspectives.For example Fan et al. [7] proposed an algorithm for functional data classification based on the random forest algorithm; Rossi and Villa [8] combined the kernel method with the support vector machine algorithm to classify functional data; Thind et al. [9] based on feedforward neural network to classify functional data; Fuchs et al. [10] Classification of functional data based on nearest neighbor classification method; Rossi et al. [11] Classification of functional data based on multilayer perceptron machine algorithm; Ke Chien-Kun's [12] classification model for Riemannian manifold functional data; Vommi Amukta Malyada et al. [13] Fuzzy KNN Hybrid Filtering and Encapsulated Feature Selection Classification based on Bonferroni Mean.
The aforementioned methods for functional data classification are all based on specific classification techniques, which have their limitations in terms of applicability.Therefore, this study proposes a functional data similarity measure based on functional Mahalanobis distance and reproducing kernel theory, and applies this measure to kernel principal component analysis, thus projecting infinite-dimensional functions into finite-dimensional spaces.The Mahalanobis distance constructed based on the reproducing kernel effectively captures the information of functional data, depicting the differences between different functions.This similarity measure is suitable for functional data analysis methods based on reproducing kernel theory.Finally, this study verifies the effectiveness of the proposed method by applying it to practical data tasks.

FPCA-based Functionalized Mahalanobis Distance
In 1936 Indian statistician P.C. Mahalanobis proposed the Mahalanobis distance, which takes into account scale-independent links between various characteristics and is a measure of distance that can be regarded as a modification of the Euclidean distance for solving the covariance distance.For a multivariate vector  = ( 1 ,  2 … ,   )  with mean  = ( 1 ,  2 … ,   )  and covariance matrix∑, its Mahalanobis distance is: For random variables X and Y that are uniformly distributed and whose covariance matrix is∑, the Mahalanobis distance between data points x,y is: Next, we introduce functional principal component analysis.It follows from Mercer's theorem that there exists a continuous sequence of functions {  (),  ≥ 1} and a monotonically decreasing sequence of positive numbers such that the following equation holds: For a random function () there is: Where () is the mean function,   is the centrality function, the eigenfunction   (),k=1,2,…,∞ are a set of pairwise orthogonal basis functions in the  2 space, and  () − ()represents the projection scores of the feature function   (), i.e.: Satisfy (  ) = 0, (  ) =   .In practical problems, the first  principal functions that can represent most of the information of the data are usually chosen.According to Ramasy and Silverman's method of solving for principal component scores can be obtained: Where  is the eigenvalue and  is the eigenfunction.Finally, we introduce the functional type principal component analysis based on the functional type martens distance.Firstly, the data can be centered to get () = 0 and then the sample principal component function is calculated according to the method (6), and then the principal component score can be obtained by projecting the data: is a matrix of N*K, which is then obtained by calculating the Mahalanobis distance for the score : Where  ̂ and  ̂() are estimates of the variance and function of the principal component scores.

Suppose the original data is
, m is the amount of data and n is the number of variables.Firstly, the characteristic covariance matrix can be obtained by mapping as: Where  is the eigenvalue and  is the eigenvector.Since  is unknown, the eigenvalues and eigenvectors cannot be derived.It can be solved by introducing a nonlinear transformation, i.e.: ( ) ( ) And satisfy the existence of   ( = 1,2, … , ) such that the following equation holds: Where () = [( 1 ), ( 2 ), … , (  )]; = [ 1 ,  2 , … ,   ]  , and thus Eq. ( 11) can be written as: The left-multiplication of both ends by ()  gives: Bringing in the kernel function  = (  ,   ) =< (  ), (  ) >,yields: The kernel function adopted in this paper is the functional martensitic distance, i.e:

Background of empirical data
ECG is a medical test used to diagnose heart diseases and abnormalities by recording the heart's electrical signals in a graphic manner.These electrical signals are generated by the electrical activity of the heart muscle.ECG visualizes this activity as a series of waveforms, usually including P waves, QRS waves, and T waves.ECG is widely used in clinical medicine.The shape, duration, and intervals of these waveforms provide important information about the health of the heart.Its main uses include: detecting abnormalities in the heart's rhythm; helping doctors determine whether a patient's heart rhythm is normal; showing signs of myocardial infarction, such as ST-segment elevation or lowering; for long-term monitoring of cardiac function; doctors can use ECGs to assess the effects of specific medications or therapeutic measures on a patient's cardiac function; and before a surgery, a doctor may order ECGs to ensure that the patient's cardiac health is suitable for surgery, etc.
ECG generates a huge amount of data, including hundreds of heartbeat waveforms, which need to be classified in order to better understand and utilize the data.By categorizing ECG data, doctors can identify heart diseases and abnormalities in order to make an accurate diagnosis, with different types of heart problems requiring different treatments; researchers use categorized ECG data to study heart health and new treatments for heart disease; and regular ECG monitoring of patients can help doctors detect potential heart problems early and take preventive measures.

Data sources
The data comes from http://timeseriesclassification.com/description.php?Dataset=ECG200, a dataset provided by R. Olszewsk from 2001, where each class tracks the electrical activity recorded during one heartbeat.As a binary classification model, these two classes are are normal heartbeat and myocardial infarction, and both the training and test sets are 100 data and 96 in length.

Comparison of experimental setup and results
In this paper, a binary actual functional dataset is used to test the algorithm of this paper, the experiments compare the regenerative kernel model of this paper based on functional martens distance with the algorithms of regenerative kernel model based on B-spline Euclidean distance and regenerative kernel model based on Euclidean distance and the results of the comparisons are expressed in terms of accuracy.All the experiments in this paper are realized using R language programming, and the experiments will be repeated 100 times, and then the average as well as the variance of the results of the 100 experiments will be taken to get the final experimental results.The experimental data is divided into training set and test set using 1:1 ratio and then experiments are conducted.
Figure 1 shows the quadrant plot of 100 experimental results of regenerative kernel classification model based on Euclidean distance, as shown in Table 2, which has the highest value of accuracy of 0.68 and the lowest value of 0.48, with the mean value of 0.5895 and variance of 0.0013.Figure 2 shows the quadrant plot of 100 experimental results of the regenerative kernel classification model based on the B-spline basis function Euclidean distance, as shown in Table 2, which has the highest accuracy value of 0.68, the lowest value of 0.5, the mean value of 0.5944, and the variance of 0.00135.Figure 3 shows the quadrant plot of 100 experimental results of the regenerative kernel classification model based on functional martens distance, as shown in Table 2, which has the highest accuracy value of 0.88.And the lowest value of 0.71, with a mean value of 0.8016 and variance of 0.00096.
In summary, it can be intuitively seen from the image with the table that the algorithm proposed in this paper is superior to the other two algorithms both in terms of accuracy and degree of discretization.The reason is because the regeneration kernel constructed based on the Mahalanobis distance captures the information of the functional data, and the KPCA dimensionality reduction method is utilized to extract the nonlinear information and reduce the amount of information lost in the functional data.

Conclusions and outlook
In this paper, we propose a regenerative kernel data classification model based on functional martens distance, the main idea is to convert the infinite-dimensional functional data into regular scalar data by principal component analysis through functional martens distance, and then use random forest based on KPCA to classify the data, so that it can be used to classify the functional data with any machine learning algorithm.The most important step of this algorithm is to convert the infinitedimensional functional data into regular scalar data without losing the information of the data, and the regenerative kernel-based functional Mahalanobis distance in this paper captures the information of the functional data well, and the KPCA dimensionality reduction method can be used to extract the nonlinear information and reduce the loss of information of the functional data.Through the analysis of actual data, it shows that the regenerative kernel classification model based on functiontype martens distance proposed in this paper has good competitiveness in the field of function-type data classification.For future research directions, the similarity metric proposed in this paper can be applied to other machine learning algorithms based on regenerative kernel for function-type data,

Figure 1 :Figure 2 :Figure 3 :
Figure 1: Results of 100 experiments based on the Euclidean distance kernel

Table 1 :
Confusion Matrix , this paper adopts the accuracy rate as the criterion for judging the algorithm, for the binary classification problem, the confusion matrix is shown in Table1, then the expression of the accuracy rate is:

Table 2 :
Mean and Variance of the Three Methods