JunJie Wee


  • weejunji[at]msu.edu

    • Department of Mathematics
    • Michigan State University
    • C114 Wells Hall
    • 619 Red Cedar Road
    • East Lansing, MI, 48824





    Publications

    Click the publications to view abstract. Expand/collapse all abstracts by clicking button above.


    [20] JunJie Wee, Guo-Wei Wei. Rapid response to fast viral evolution using AlphaFold 3-assisted topological deep learning, (2024). arXiv preprint arXiv:2411.12370

    The fast evolution of SARS-CoV-2 and other infectious viruses poses a grand challenge to the rapid response in terms of viral tracking, diagnostics, and design and manufacture of monoclonal antibodies (mAbs) and vaccines, which are both time-consuming and costly. This underscores the need for efficient computational approaches. Recent advancements, like topological deep learning (TDL), have introduced powerful tools for forecasting emerging dominant variants, yet they require deep mutational scanning (DMS) of viral surface proteins and associated three-dimensional (3D) protein-protein interaction (PPI) complex structures. We propose an AlphaFold 3 (AF3)-assisted multi-task topological Laplacian (MT-TopLap) strategy to address this need. MT-TopLap combines deep learning with topological data analysis (TDA) models, such as persistent Laplacians (PL) to extract detailed topological and geometric characteristics of PPIs, thereby enhancing the prediction of DMS and binding free energy (BFE) changes upon virus mutations. Validation with four experimental DMS datasets of SARS-CoV-2 spike receptor-binding domain (RBD) and the human angiotensin-converting enzyme-2 (ACE2) complexes indicates that our AF3 assisted MT-TopLap strategy maintains robust performance, with only an average 1.1% decrease in Pearson correlation coefficients (PCC) and an average 9.3% increase in root mean square errors (RMSE), compared with the use of experimental structures. Additionally, AF3-assisted MT-TopLap achieved a PCC of 0.81 when tested with a SARS-CoV-2 HK.3 variant DMS dataset, confirming its capability to accurately predict BFE changes and adapt to new experimental data, thereby showcasing its potential for rapid and effective response to fast viral evolution.

    [16] Dong Chen, Gengzhuo Liu, Hongyan Du, JunJie Wee, Rui Wang, Jiahui Chen, Jana Shen, Guo-Wei Wei. Drug resistance revealed by in silico deep mutational scanning and mutation tracker, (2024). arXiv preprint arXiv:2403.02603.

    As COVID-19 enters its fifth year, it continues to pose a significant global health threat, with the constantly mutating SARS-CoV-2 virus challenging drug effectiveness. A comprehensive understanding of virus-drug interactions is essential for predicting and improving drug effectiveness, especially in combating drug resistance during the pandemic. In response, the Path Laplacian Transformer-based Prospective Analysis Framework (PLFormer-PAF) has been proposed, integrating historical data analysis and predictive modeling strategies. This dual-strategy approach utilizes path topology to transform protein-ligand complexes into topological sequences, enabling the use of advanced large language models for analyzing protein-ligand interactions and enhancing its reliability with factual insights garnered from historical data. It has shown unparalleled performance in predicting binding affinity tasks across various benchmarks, including specific evaluations related to SARS-CoV-2, and assesses the impact of virus mutations on drug efficacy, offering crucial insights into potential drug resistance. The predictions align with observed mutation patterns in SARS-CoV-2, indicating that the widespread use of the Pfizer drug has lead to viral evolution and reduced drug efficacy. PLFormer-PAF's capabilities extend beyond identifying drug-resistant strains, positioning it as a key tool in drug discovery research and the development of new therapeutic strategies against fast-mutating viruses like COVID-19.

    [13] Jialin Bi, JunJie Wee, Xiang Liu, Cunquan Qu, Guanghui Wang, and Kelin Xia. Multiscale Topological Indices for the Quantitative Prediction of SARS CoV-2 Binding Affinity Change upon Mutations. Journal of Chemical Information and Modeling (2023).

    The Coronavirus disease 2019 (COVID-19) has affected people’s lives and the development of the global economy. Biologically, protein–protein interactions between SARS-CoV-2 surface spike (S) protein and human ACE2 protein are the key mechanism behind the COVID-19 disease. In this study, we provide insights into interactions between the SARS-CoV-2 S-protein and ACE2, and propose topological indices to quantitatively characterize the impact of mutations on binding affinity changes (ΔΔG). In our model, a series of nested simplicial complexes and their related adjacency matrices at various different scales are generated from a specially designed filtration process, based on the 3D structures of spike-ACE2 protein complexes. We develop a set of multiscale simplicial complexes-based topological indices, for the first time. Unlike previous graph network models, which give only a qualitative analysis, our topological indices can provide a quantitative prediction of the binding affinity change caused by mutations and achieve great accuracy. In particular, for mutations that happened at specifical amino acids, such as Polar amino acids or Arginine amino acids, the correlation between our topological gravity model index and binding affinity change, in terms of Pearson correlation coefficient, can be higher than 0.8. As far as we know, this is the first time multiscale topological indices have been used in the quantitative analysis of protein–protein interactions.

    [8] D. Vijay Anand, Qiang Xu, JunJie Wee, Kelin Xia, Tze Chien Sum. Topological Feature Engineering for Machine Learning Based Halide Perovskite Materials Design. npj Computational Materials. (2022). 8(1), 203.

    Accelerated materials development with machine learning (ML) assisted screening and high throughput experimentation for new photovoltaic materials holds the key to addressing our grand energy challenges. Data-driven ML is envisaged as a decisive enabler for new perovskite materials discovery. However, its full potential can be severely curtailed by poorly represented molecular descriptors (or fingerprints). Optimal descriptors are essential for establishing effective mathematical representations of quantitative structure-property relationships. Here we reveal that our persistent functions (PFs) based learning models offer significant accuracy advantages over traditional descriptor based models in organic-inorganic halide perovskite (OIHP) materials design and have similar performance as deep learning models. Our multiscale simplicial complex approach not only provides a more precise representation for OIHP structures and underlying interactions, but also has better transferability to ML models. Our results demonstrate that advanced geometrical and topological invariants are highly efficient feature engineering approaches that can markedly improve the performance of learning models for molecular data analysis. Further, new structure-property relationships can be established between our invariants and bandgaps. {We anticipate that our molecular representations and featurization models will transcend the limitations of conventional approaches and lead to breakthroughs in perovskite materials design and discovery.

    [6] Ronald Koh Joon Wei, JunJie Wee, Valerie Evangelin Laurent and Kelin Xia. Hodge theory-based biomolecular data analysis. (2022). Scientific Reports. 12(1), 1-16.

    Hodge theory reveals the deep intrinsic relations of differential forms and provides a bridge between differential geometry, algebraic topology, and functional analysis. Here we use Hodge Laplacian and Hodge decomposition models to analyze biomolecular structures. Different from traditional graph-based methods, biomolecular structures are represented as simplicial complexes, which can be viewed as a generalization of graph models to their higher-dimensional counterparts. Hodge Laplacian matrices at different dimensions can be generated from the simplicial complex. The spectral information of these matrices can be used to study intrinsic topological information of biomolecular structures. Essentially, the number (or multiplicity) of k-th dimensional zero eigenvalues is equivalent to the k-th Betti number, i.e., the number of k-th dimensional homology groups. The associated eigenvectors indicate the homological generators, i.e., circles or holes within the molecular-based simplicial complex. Furthermore, Hodge decomposition-based HodgeRank model is used to characterize the folding or compactness of the molecular structures, in particular, the topological associated domain (TAD) in high-throughput chromosome conformation capture (Hi-C) data. Mathematically, molecular structures are represented in simplicial complexes with certain edge flows. The HodgeRank-based average/total inconsistency (AI/TI) is used for the quantitative measurements of the folding or compactness of TADs. This is the first quantitative measurement for TAD regions, as far as we know.

    [5] Gong Weikang, Wee JunJie, Wu Min-Chun, Sun Xiaohan, Li Chunhua, Xia Kelin. Persistent spectral simplicial complex-based machine learning (PerSpectSC-ML) for chromosomal structural analysis in cellular differentiation. Briefings in Bioinformatics. (2022). Volume 23, Issue 4, bbac168.

    The three-dimensional (3D) chromosomal structure plays an essential role in all DNA-templated processes, including gene transcription, DNA replication, and other cellular processes. Although developing chromosome conformation capture (3C) methods, such as Hi-C, which can generate chromosomal contact data characterized genome-wide chromosomal structural properties, understanding 3D genomic nature-based on Hi-C data remains lacking. Here, we propose a persistent spectral simplicial complex (PerSpectSC) model to describe Hi-C data for the first time. Specifically, a filtration process is introduced to generate a series of nested simplicial complexes at different scales. For each of these simplicial complexes, its spectral information can be calculated from the corresponding Hodge Laplacian matrix. PerSpectSC model describes the persistence and variation of the spectral information of the nested simplicial complexes during the filtration process. Different from all previous models, our PerSpectSC-based features provide a quantitative global-scale characterization of chromosome structures and topology. Our descriptors can successfully classify cell types and also cellular differentiation stages for all the 24 types of chromosomes simultaneously. In particular, persistent minimum best characterizes cell types and Dim (1) persistent multiplicity best characterizes cellular differentiation. These results demonstrate the great potential of our PerSpectSC-based models in polymeric data analysis.

    [3] Wee JunJie, Xia Kelin. Forman persistent Ricci curvature (FPRC) based machine learning for protein-ligand binding affinity prediction. Briefings in Bioinformatics (2021). Volume 22, Issue 6, bbab136.

    Artificial intelligence (AI) techniques have already been gradually applied to the entire drug design process, from target discovery, lead discovery, lead optimization and preclinical development to the final three phases of clinical trials. Currently, one of the central challenges for AI-based drug design is molecular featurization, which is to identify or design appropriate molecular descriptors or fingerprints. Efficient and transferable molecular descriptors are key to the success of all AI-based drug design models. Here we propose Forman persistent Ricci curvature (FPRC)-based molecular featurization and feature engineering, for the first time. Molecular structures and interactions are modeled as simplicial complexes, which are generalization of graphs to their higher dimensional counterparts. Further, a multiscale representation is achieved through a filtration process, during which a series of nested simplicial complexes at different scales are generated. Forman Ricci curvatures (FRCs) are calculated on the series of simplicial complexes, and the persistence and variation of FRCs during the filtration process is defined as FPRC. Moreover, persistent attributes, which are FPRC-based functions and properties, are employed as molecular descriptors, and combined with machine learning models, in particular, gradient boosting tree (GBT). Our FPRC-GBT models are extensively trained and tested on three most commonly-used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. It has been found that our results are better than the ones from machine learning models with traditional molecular descriptors.


    Theses

    Click the dissertations to view abstract.


    [2] Wee JunJie. (2023). Geometric and Topological AI for Molecular Sciences. Doctoral Thesis. Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165903

    Data-driven sciences are widely regarded as the fourth paradigm of sciences that will fundamentally change the society and our daily lives. Indeed, artificial intelligence (AI) models have already revolutionized and transformed various data-intensive industries. Machine learning (ML) and deep learning models have achieved unprecedented extraordinary performance in image, text, audio, video, and network data analysis. This is largely due to the rise in three major advancements, i.e., accumulation of big data, rise in computational power, and design of highly efficient algorithms. In particular, AlphaFold2 made a remarkable achievement for protein-folding problems which heralds a new era for AI-based molecular data analysis for materials, chemistry, and biology. With excitement and opportunities, AI for molecular sciences also comes with challenges. In this dissertation, we will tackle one of the main challenges in AI for molecular sciences which is constructing or designing effective molecular descriptors and fingerprints. Ideally, effective molecular descriptors should preserve the utmost important features while still possessing the ability to capture the intrinsic molecular properties and information that directly dictate molecular functions. In this way, they can be better “understood” by ML models. This has inspired various researchers to apply topological data analysis (TDA) where persistent homology (PH) and its intrinsic topological invariants act as an excellent and robust molecular featurization method to capture and characterize the underlying topological information in biomolecular systems. By extending beyond the capabilities of TDA, we propose geometric and topological AI for molecular sciences. In this dissertation, two novel persistent functions, namely persistent Ricci curvature (PRC) and persistent Dirac operators are developed as new advanced mathematics-based molecular featurization which can build advanced mathematics-based ML models to perform unsupervised and supervised learning in molecular sciences. In biological data, we built Ollivier persistent Ricci curvature and Forman persistent Ricci curvature-based ML models to predict protein-ligand binding affinity values. Also, we constructed a persistent spectral-based ensemble learning model (PerSpect-EL) to capture and characterize protein-protein interactions upon mutational change. Our PerSpect-EL model has outperformed several existing traditional molecular descriptor-based models in protein-protein binding affinity change predictions. In materials data, we designed both PH-based and persistent Forman curvature (PFC)-based ML models to characterize organic-inorganic halide perovskites (OIHPs). Essentially, our PH-based and PFC-based molecular features produced strong discriminating power in classifying 9 types of OIHPs. Both models have also outperformed traditional perovskite descriptors in materials property predictions such as bandgap, dielectric constant, and refractive index.