Identification of key gene associated with periodontitis and prediction of therapeutic drugs using machine learning in combination with LIME model explainer

: Periodontitis is an immune-inflammatory disease characterized by irreversible periodontal attachment loss and bone destruction. In this study, we downloaded two microarray datasets, GSE10334 and GSE16134, from the Gene Expression Omnibus (GEO) database to identify molecular biomarkers and potential mechanisms associated with periodontitis. We performed differential gene expression analysis using the Limma package and co-expression network analysis. Additionally, we used machine learning with L1 regularization and LIME model explainer to identify the most relevant gene, ISL1. Finally, we validated molecular docking experiments using AutoDockTool and PyMOL. GO and KEGG enrichment analyses showed that periodontitis may affect various biological processes, including transcription, gene expression, apoptosis, and proliferation regulation. We found that periodontitis may influence cytokine-cytokine receptor interaction, lipid and atherosclerosis, and IL-17 signaling pathway. Our molecular docking results demonstrated that all of the major targets selected could be stably bound by the active components we chose. In summary, this study provides the hub gene, ISL1. We also identified 9 active components that may play a role in regulating ISL1 in periodontitis.


Introduction
Periodontitis is a prevalent chronic immune-inflammatory disease triggered by microbial plaque, which is characterized by gradual loss of soft tissue support and bone resorption [1].The pathophysiology of periodontitis is marked by an excess of pro-inflammatory factors required for inflammation resolution and insufficient resolution factors [2].The Fourth National Oral Health Epidemiological Survey in China indicated that 87%-97% of Chinese adults exhibit varying degrees of periodontal disease.If left untreated, periodontitis can result in tooth mobility and eventual tooth loss [3,4].Furthermore, periodontitis is associated with systemic diseases such as cardiovascular disease [5], Alzheimer's disease [6], diabetes, and insulin resistance.Thus, early diagnosis of periodontitis is critical for protecting the alveolar bone, maintaining tooth stability, and potentially preventing related diseases [7].
A prerequisite for utilizing machine learning to screen for periodontitis-related genes is the design of strategies that not only perform well on training data but also generalize well to new inputs.Regularization techniques are explicitly designed to reduce testing error and are defined as "modifications to the learning algorithm aimed at reducing the generalization error instead of the training error."[8] In other words, the objective of regularization is to prevent overfitting, reduce generalization error, and enhance generalization ability.Developing more effective regularization strategies has become one of the primary research topics in machine learning.Currently, a variety of regularization strategies exist, with the most basic method entailing adding a penalty term to the original objective function to penalize models with high capacity [9].The mathematical expression is as follows: where X and y are the training samples and their corresponding labels, θ is the parameter, J is the objective function, Ω is the penalty term, and α controls the strength of regularization.Different Ω functions have different preferences for the optimal solution of the parameter θ, resulting in varying regularization effects.In deep learning, it is common practice to regularize only the weights and not the biases.The two most commonly used Ω functions are L1 norm and L2 norm.When p = 1, it is the L1 norm, which represents the sum of the absolute values of the nonzero elements in the vector.According to the definition of the LP norm, the mathematical form of the L1 norm is as follows: . ( The L1 norm is usually used to identify the optimal and sparse feature items [10].
In this study, we first observed the differential expression of various expression profiles based on GSE10334 and GSE16134 in periodontitis and healthy samples.Functional analysis revealed that the differentially expressed genes mainly involved immune-related biological processes.Additionally, the CIBERSORT algorithm demonstrated significant differences in the abundance of most immune cells between periodontitis and healthy samples.The central genes were identified through L1 regularization and LIME model interpretation, which facilitates an understanding of the pathogenesis of periodontitis and may serve as a therapeutic target.

Microarray data acquisition
The GEO database (https://www.ncbi.nlm.nih.gov/geo/)provided two datasets: GSE10334, GSE16134.Table 1 gives more information about the gene expression profiles used in this study.GSE10334 [11] and GSE16134 [12] based on GPL570 platform included array based gene expression profiles of periodontitis.

Data merging and Differentially Expressed Genes (DEGs) selection
The series matrix files were converted to gene symbol codes using Active Perl 5.30.0 software (https://www.activestate.com/products/perl/). Then the 'combat' function of the 'SVA' package of R software was used to adjust batch effects using empirical Bayes models after all microarray data had been merged.Finally, we used the 'normalize' function of the 'Limma' package in R software to normalize the expressions of the datasets [13].A gene was defined as a DEG between the periodontitis and normal samples when the adjusted P value was<0.05 and the |log2FC| >1, which were visualized as Volcano plots and heat map plots.

GO and KEGG Analysis
The Gene Ontology (GO) database comprised categories of Biological Processes (BP), Cellular Composition (CC), and Molecular Function (MF).The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were derived from the org.hs.eg.db package, clusterProfiler package (https://github.com/YuLab-SMU/clusterProfiler),and ggplot2 package (version 3.3.6for visualization) in the R software.Homo sapiens was designated as the species of interest, with a screening threshold of p.adjust<0.05, to obtain the primary enriched functions and pathways.

Immune infiltration analysis
The CIBERSORT algorithm was utilized to analyze the immune landscape of the microenvironment between the normal and periodontitis groups.The combined dataset served as the gene expression input, with the LM22 gene signature file consisting of 22 immune cell types.The analysis was conducted with 1, 000 permutations, and the resulting CIBERSORT values represented the fraction of immune cell infiltration per sample.

Screen hub gene
L1 regularization is a common method in linear regression, reducing model complexity and overfitting by adding an L1 norm penalty term to the loss function.This leads to sparse solutions, making it useful in machine learning applications like LASSO regression and sparse coding.
This method is particularly effective in identifying genes related to periodontitis, a prevalent oral disease with genetic links.By screening gene expression profiling data, L1 regularization can pinpoint relevant genes, using the size and sign of model parameters for analysis.
LIME (Local Interpretable Model-Agnostic Explanations) is another valuable tool, capable of explaining predictions from any black-box model.It constructs a locally interpretable model for each instance, making it useful for explaining periodontitis-related features and genes.It trains a blackbox model using gene expression profiling data, predicts periodontitis, and then explains the prediction for a particular sample.This process involves generating a similar dataset, calculating feature contributions, selecting important features, and interpreting the model's prediction.This enhances understanding of periodontitis pathogenesis and can provide new diagnostic and treatment insights.

Molecular Docking Verification
Download 3D structures of 9 potentially active ingredients from the PubChem database(https://pubchem.ncbi.nlm.nih.gov/).The 3D structure of the hub gene is download from the PDB protein database(http://www.rcsb.org/pdb/home/home).Then the protein was dehydrated and ligand extracted with PyMOL software.Then Autodock software was used to conduct molecular simulation docking between 9 potential active ingredients and hub gene, and the binding strength of hub gene and 9 active ingredients was evaluated according to the docking binding energy.

Identification of DEGs
Gene expression of merged GEO series that have been adjusted for batcheffects were standardized.The DEGs were analyzed using the 'Limma'package.After consolidation and normalization, 146 DEGs (|logFC| >1, P < 0.05) between untreated and periodontitis subjects were screened.Among them, 107 genes were upregulated and 39 genes were downregulated.We select 20 upregulated and 20 downregulated show in the heatmap (Figure 1A).A volcano plot was used to show the upregulation and downregulation (as shown in Figure 1B).

GO term analysis
We analyzed the DEGs using GO analyses to learn more about the biological functions involved in periodontitis samples.As shown in Figure 2A, changes in GO biological processes (BP) mainly included humoral immune response, Phagocytosis and activation of immune response.Genes primarily enriched in CC category were cornified envelope and external side of plasma membrane, Moreover, molecular function (MF) section, changes were significant in chemokine activity, chemokine receptor binding and G protein−coupled receptor binding.As shown in Figure 2.
(A) result of GO enrichment (B) Bubble plot of GO terms (C) Heatmap plot of GO terms (D) circle plot of GO terms.
Figure 2: GO term analysis.

KEGG pathway enrichment analysis
KEGG pathway analyses were performed using the R software cluster Profiler package.In Figure 3

Immune landscape of periodontitis
Moreover, the CIBERSORT algorithm was used to quantify the proportions of immune cells to evaluate the associations between the dataset and the immune microenvironment (as shown in Figure 4A).After that, the difference in immune infiltration between periodontitis and untreated groups was investigated in 22 immune cell types.The periodontitis group had a significantly higher ratio of Plasma cells (As show in Figure 4B).Next, we assessed the correlation between ISL1 and immune cells (As show in Figure 4C).

Identification of hub gene
In this study, we applied L1 regularization and LIME model interpretability techniques to identify key genes associated with a particular disease.L1 regularization is a widely used method for feature selection in machine learning, which penalizes model coefficients that are not relevant to the prediction task (As show in Figure 5).LIME is a model-agnostic technique that explains the predictions of any machine learning model by approximating its behavior in the local neighborhood of a given instance (As show in Figure 6).
Using these techniques, we were able to identify ISL1 as a key gene associated with the disease under study.ISL1 is a transcription factor that plays a crucial role in the development of various tissues, including the heart and nervous system.Our analysis suggests that ISL1 may be a potential therapeutic target for this disease.
Overall, our results demonstrate the effectiveness of combining L1 regularization and LIME model interpretability techniques for identifying key genes and potential therapeutic targets in complex diseases.Further studies are needed to validate the role of ISL1 in this particular disease and to explore its potential as a therapeutic target.

Molecular Docking Verification
We verified the binding energy of potential chemical components of these 9 compounds on ISL1 using molecular docking technology, and the results are presented in Table 2.It is generally believed that the lower the binding energy of the ligand to the receptor, the more likely the ligand is to interact with the receptor.Our results showed that the binding energy of all 9 predicted active components with ISL1 was less than -6 kcal/mol, which indicates a potential interaction between them.The molecular docking mode is shown in Figure 7.

Discussion
Periodontitis, an immunoinflammatory disease [14,15], can lead to irreversible bone destruction [16].Traditional treatments, focusing on disrupting dental plaque biofilms, show unsatisfactory prognosis in some populations.Hence, understanding its etiological mechanism is crucial for comprehensive treatment strategies.
Using L1 regularization and LIME, we identified ISL1 as a key gene in periodontitis.ISL1, a protein, regulates Bmp4 transcription [17], and research suggests that dental epithelial stem cells could potentially generate new teeth.However, understanding of their regulation isn't sufficient for successful implementation.Animal studies show Fgf10 as a major regulator of dental epithelial stem cell niche [18], with Shh signaling activity maintaining the stem cell niche.The FAK-YAP-mTOR pathway regulates the balance between stem cell proliferation and differentiation into enamel-forming cells [19,20], with ISL1 expression and Shh signaling pathway activity crucial for proper enamel pattern.
Molecular docking validated the active ingredients that may have regulatory effects on the HUB gene ISL1.Among them, Benzo[a]pyrene hydrocarbon receptor signaling inhibits osteoblastic differentiation and collagen synthesis of human periodontal ligament cells [21].Dorsomorphin attenuates Jagged1-induced mineralization in human dental pulp cells [22].Fenretinide has been shown to have an anti-inflammatory effect.Fenretinide inhibited chemokine [23] and chemokine receptor expression [24] in vitro.In animal studies, fenretinide suppressed chronic arthritis induced by administration of streptococcal cell wall [25] and decreased the mRNA levels of proinflammatory mediators in the spinal cord after a spinal cord injury [26].The application of topical tretinoin acid gel resulted in a 50 percent reduction in the incidence of oral leukoplakia.[27].
However, it is imperative to note that our study is subject to certain limitations.Firstly, the samples utilized in this investigation lacked essential clinicopathological information.As such, our identification of diagnostic markers was limited solely to the transcriptomic level.Secondly, the current transcriptomic datasets available for periodontitis in GEO were restricted, rendering validation of our findings challenging due to inadequate data.Thirdly, the outcomes of our bioinformatics analysis alone may not suffice to establish conclusive evidence, and as such, experimental validation is necessary to confirm our findings.

Conclusion
In this study, we analyzed the immunoregulatory effects, affected biological processes, and signaling pathways of periodontitis.We identified the most relevant gene in periodontitis, ISL1, by analyzing the dataset using L1 regularization and the LIME interpretability model.This finding provides new insights into the prevention and treatment of periodontitis.Additionally, we identified nine active ingredients that may play a role in regulating the ISL1 gene, which could contribute to further research on the pathogenesis of periodontitis.

(
A) Heat maps of 40 DEGs were selected.(B) volcanal map of all the DEGs.
it showed that DEGs were significantly associated with Cytokine−cytokine receptor interaction, Lipid and atherosclerosis, IL−17 signaling pathway, Viral protein interaction with cytokine and cytokine receptor, Rheumatoid arthritis and Chemokine signaling pathway.(A) result of KEGG enrichment (B) Bubble plot of KEGG terms (C) Heatmap plot of KEGG terms (D) circle plot of KEGG terms.
(A) immune cell distribution (B) the landscape of immune (C) the fraction of immune cells in normal and periodontitis groups.

Figure 6 :
Figure 6: XML visualization generated from LIME's API package

Figure 7 :
Figure 7: Schematic diagram of molecular docking of 9 potential drugs

Table 1
Characteristics of datasets in this study

Table 2 :
The binding energy of active components to ISL1 by molecular docking.