Wee JunJie

[27] Hai Khoi Le, JunJie Wee. A Korenblum Maximum Principle for Weighted Hilbert spaces of entire Dirichlet series with real frequencies. (2025). Journal of Mathematics and Mathematical Sciences. (accepted)

In this paper, we study a Korenblum Maximum Principle for weighted Hilbert spaces of entire Dirichlet series with real frequencies. We investigate dominating sets for which the Korenblum Maximum Principle must hold. The results obtained imply that a dominating set, if exists, must be a left half-plane. This provides a new perspective for studying Korenblum Maximum Principle on function spaces containing the entire Dirichlet series.

[26] JunJie Wee, Jian Jiang. A review of topological data analysis and topological deep learning in molecular sciences. (2025). Journal of Chemical Information and Modeling. (accepted)

Topological Data Analysis (TDA) has emerged as a powerful framework for extracting robust, multiscale, and interpretable features from complex molecular data for artificial intelligence (AI) modeling and topological deep learning (TDL). This review provides a comprehensive overview of the development, methodologies, and applications of TDA in molecular sciences. We trace the evolution of TDA from early qualitative tools to advanced quantitative and predictive models, highlighting innovations such as persistent homology, persistent Laplacians, and topological machine learning. The paper explores TDA’s transformative impact across diverse domains, including biomolecular stability, protein–ligand interactions, drug discovery, materials science, and viral evolution. Special attention is given to recent advances in integrating TDA with machine learning and AI, enabling breakthroughs in protein engineering, solubility and toxicity prediction, and the discovery of novel materials and therapeutics. We also discuss the limitations of current TDA approaches and outline future directions, including the integration of TDA with advanced AI models and the development of new topological invariants. This review aims to serve as a foundational reference for researchers seeking to harness the power of topology in molecular science.

[25] Xiang Liu, JunJie Wee, Guo-Wei Wei. Topological Machine Learning for Protein-Nucleic Acid Binding Affinity Changes Upon Mutation, (2025). Machine Learning Science and Technology. (accepted)

Understanding how protein mutations affect protein-nucleic acid binding is critical for unraveling disease mechanisms and advancing therapies. Current experimental approaches are laborious, and computational methods remain limited in accuracy. To address this challenge, we propose a novel topological machine learning model (TopoML) combining persistent Laplacian (from topological data analysis) with multi-perspective features: physicochemical properties, topological structures, and protein Transformer-derived sequence embeddings. This integrative framework captures robust representations of protein-nucleic acid binding interactions. To validate the proposed method, we employ two datasets, a protein-DNA dataset with 596 single-point amino acid mutations, and a protein-RNA dataset with 710 single-point amino acid mutations. We show that the proposed TopoML model outperforms state-of-the-art methods in predicting mutation-induced binding affinity changes for protein-DNA and protein-RNA complexes.

[24] Hongsong Feng, Faisal Suwayyid, Mushal Zia, JunJie Wee, Yuta Hozumi, Chunlong Chen, Guo-Wei Wei. CAML: Commutative algebra machine learning--a case study on protein-ligand binding affinity prediction, (2025). Journal of Chemical Information and Modeling, 65, 13, 6732–6743

Recently, Suwayyid and Wei have introduced commutative algebra as an emerging paradigm for machine learning and data science. In this work, we integrate commutative algebra machine learning (CAML) for the prediction of protein-ligand binding affinities. Specifically, we apply persistent Stanley-Reisner theory, a key concept in combinatorial commutative algebra, to the affinity predictions of protein-ligand binding and metalloprotein-ligand binding. We introduce three new algorithms, i.e., element-specific commutative algebra, category-specific commutative algebra, and commutative algebra on bipartite complexes, to address the complexity of data involved in (metallo) protein-ligand complexes. We show that the proposed CAML outperforms other state-of-the-art methods in (metallo) protein-ligand binding affinity predictions.

[23] JunJie Wee, Guo-Wei Wei. Rapid response to fast viral evolution using AlphaFold 3-assisted topological deep learning, (2025). Virus Evolution, 11(1), veaf026

The fast evolution of SARS-CoV-2 and other infectious viruses poses a grand challenge to the rapid response in terms of viral tracking, diagnostics, and design and manufacture of monoclonal antibodies (mAbs) and vaccines, which are both time-consuming and costly. This underscores the need for efficient computational approaches. Recent advancements, like topological deep learning (TDL), have introduced powerful tools for forecasting emerging dominant variants, yet they require deep mutational scanning (DMS) of viral surface proteins and associated three-dimensional (3D) protein-protein interaction (PPI) complex structures. We propose an AlphaFold 3 (AF3)-assisted multi-task topological Laplacian (MT-TopLap) strategy to address this need. MT-TopLap combines deep learning with topological data analysis (TDA) models, such as persistent Laplacians (PL) to extract detailed topological and geometric characteristics of PPIs, thereby enhancing the prediction of DMS and binding free energy (BFE) changes upon virus mutations. Validation with four experimental DMS datasets of SARS-CoV-2 spike receptor-binding domain (RBD) and the human angiotensin-converting enzyme-2 (ACE2) complexes indicates that our AF3 assisted MT-TopLap strategy maintains robust performance, with only an average 1.1% decrease in Pearson correlation coefficients (PCC) and an average 9.3% increase in root mean square errors (RMSE), compared with the use of experimental structures. Additionally, AF3-assisted MT-TopLap achieved a PCC of 0.81 when tested with a SARS-CoV-2 HK.3 variant DMS dataset, confirming its capability to accurately predict BFE changes and adapt to new experimental data, thereby showcasing its potential for rapid and effective response to fast viral evolution.

[22] Joshua Zhi En Tan, JunJie Wee, Xue Gong, Kelin Xia. Topology-enhanced machine learning model (Top-ML) for anticancer peptide prediction, (2025). Journal of Chemical Information and Modeling, 65, 8, 4232–4242

Recently, therapeutic peptides have demonstrated great promise for cancer treatment. To explore powerful anticancer peptides, artificial intelligence (AI)-based approaches have been developed to systematically screen potential candidates. However, the lack of efficient featurization of peptides has become a bottleneck for these machine-learning models. In this paper, we propose a topology-enhanced machine learning model (Top-ML) for anticancer peptide prediction. Our Top-ML employs peptide topological features derived from its sequence "connection" information characterized by vector and spectral descriptors. Our Top-ML model has been validated on two widely used AntiCP 2.0 benchmark datasets and has achieved state-of-the-art performance. Our results highlight the potential of leveraging novel topology-based featurization to accelerate the identification of anticancer peptides.

[21] JunJie Wee, Xue Gong, Wilderich Tuschmann, Kelin Xia. A cohomology-based Gromov-Hausdorff metric approach for quantifying molecular similarity, (2025). Scientific Reports. 15(1), 10458.

We introduce, for the first time, a cohomology-based Gromov-Hausdorff ultrametric method to analyze 1-dimensional and higher-dimensional (co)homology groups, focusing on loops, voids, and higher-dimensional cavity structures in simplicial complexes, to address typical clustering questions arising in molecular data analysis. The Gromov-Hausdorff distance quantifies the dissimilarity between two metric spaces. In this framework, molecules are represented as simplicial complexes, and their cohomology vector spaces are computed to capture intrinsic topological invariants encoding loop and cavity structures. These vector spaces are equipped with a suitable distance measure, enabling the computation of the Gromov-Hausdorff ultrametric to evaluate structural dissimilarities. We demonstrate the methodology using organic-inorganic halide perovskite (OIHP) structures. The results highlight the effectiveness of this approach in clustering various molecular structures. By incorporating geometric information, our method provides deeper insights compared to traditional persistent homology techniques.

[20] Dong Chen, Gengzhuo Liu, Hongyan Du, JunJie Wee, Rui Wang, Jiahui Chen, Jana Shen, Guo-Wei Wei. Drug Resistance Predictions Based on a Directed Flag Transformer, (2024). Advanced Science, e02756.

The continuous evolution of the SARS-CoV-2 virus poses a significant challenge to global public health. Of particular concern is the potential resistance to the widely prescribed drug PAXLOVID, of which the main ingredient nirmatrelvir inhibits the viral main protease (Mpro). Here, we developed CAPTURE (direCted flAg laPlacian Transformer for drUg Resistance prEdictions) to analyze the effects of Mpro mutations on nirmatrelvir-Mpro binding affinities and identify potential drug-resistant mutations. CAPTURE combines a comprehensive mutation analysis with a resistance prediction module based on DFFormer-seq, which is a novel ensemble model that leverages a new Directed Flag Transformer and sequence embeddings from the protein and small-molecule-large-language models. Our analysis of the evolution of Mpro mutations revealed a progressive increase in mutation frequencies for residues near the binding site between May and December 2022, suggesting that the widespread use of PAXLOVID created a selective pressure that accelerated the evolution of drug-resistant variants. Applied to mutations at the nirmatrelvir-Mpro binding site, CAPTURE identified several potential resistance mutations, including H172Y and F140L, which have been experimentally confirmed, as well as five other mutations that await experimental verification. CAPTURE evaluation in a limited experimental data set on Mpro mutants gives a recall of 57\% and a precision of 71\% for predicting potential drug-resistant mutations. Our work establishes a powerful new framework for predicting drug-resistant mutations and real-time viral surveillance. The insights also guide the rational design of more resilient next-generation therapeutics.

[19] JunJie Wee, Jiahui Chen, Guo-Wei Wei. Preventing future zoonosis: SARS-CoV-2 mutations enhance human–animal cross-transmission. (2024). Computers in Biology and Medicine, 182, 109101.

The COVID-19 pandemic has driven substantial evolution of the SARS-CoV-2 virus, yielding subvariants that exhibit enhanced infectiousness in humans. However, this adaptive advantage may not universally extend to zoonotic transmission. In this work, we hypothesize that viral adaptations favoring animal hosts do not necessarily correlate with increased human infectivity. In addition, we consider the potential for gain-of-function mutations that could facilitate the virus’s rapid evolution in humans following adaptation in animal hosts. Specifically, we identify the SARS-CoV-2 receptor-binding domain (RBD) mutations that enhance human–animal cross-transmission. To this end, we construct a multitask deep learning model, MT-TopLap trained on multiple deep mutational scanning datasets, to accurately predict the binding free energy changes upon mutation for the RBD to ACE2 of various species, including humans, cats, bats, deer, and hamsters. By analyzing these changes, we identified key RBD mutations such as Q498H in SARS-CoV-2 and R493K in the BA.2 variant that are likely to increase the potential for human–animal cross-transmission.

[18] JunJie Wee, Guo-Wei Wei. Evaluation of AlphaFold 3’s Protein–Protein Complexes for Predicting Binding Free Energy Changes upon Mutation. (2024). Journal of Chemical Information and Modeling, 64, 16, 6676–6683.

AlphaFold 3 (AF3), the latest version of protein structure prediction software, goes beyond its predecessors by predicting protein-protein complexes. It could revolutionize drug discovery and protein engineering, marking a major step towards comprehensive, automated protein structure prediction. However, independent validation of AF3's predictions is necessary. Evaluated using the SKEMPI 2.0 database which involves 317 protein-protein complexes and 8338 mutations, AF3 complex structures give rise to a very good Pearson correlation coefficient of 0.86 for predicting protein-protein binding free energy changes upon mutation, slightly less than the 0.88 achieved earlier with the Protein Data Bank (PDB) structures. Nonetheless, AF3 complex structures led to a 8.6\% increase in the prediction RMSE compared to original PDB complex structures. Additionally, some of AF3's complex structures have large errors, which were not captured in its ipTM performance metric. Finally, it is found that AF3's complex structures are not reliable for intrinsically flexible regions or domains.

[17] JunJie Wee, Jiahui Chen, Kelin Xia, Guo-Wei Wei. Integration of persistent Laplacian and pre-trained transformer for protein solubility changes upon mutation. Computers in Biology and Medicine 169 (2024). 107918.

Protein mutations can significantly influence protein solubility, which results in altered protein functions and leads to various diseases. Despite of tremendous effort, machine learning prediction of protein solubility changes upon mutation remains a challenging task as indicated by the poor scores of normalized Correct Prediction Ratio (CPR). Part of the challenge stems from the fact that there is no three-dimensional (3D) structures for the wild-type and mutant proteins. This work integrates persistent Laplacians and pre-trained Transformer for the task. The Transformer, pretrained with hunderds of millions of protein sequences, embeds wild-type and mutant sequences, while persistent Laplacians track the topological invariant change and homotopic shape evolution induced by mutations in 3D protein structures, which are rendered from AlphaFold2. The resulting machine learning model was trained on an extensive data set labeled with three solubility types. Our model outperforms all existing predictive methods and improves the state-of-the-art up to 15%.

[16] Cong Shen, Pingjian Ding, JunJie Wee, Jialin Bi, Jiawei Luo, Kelin Xia. Curvature-enhanced Graph Convolutional Network for Biomolecular Interaction Prediction. Computational and Structural Biotechnology Journal (2024).

Geometric deep learning has demonstrated a great potential in non-Euclidean data analysis. The incorporation of geometric insights into learning architecture is vital to its success. Here we propose a curvature-enhanced graph convolutional network (CGCN) for biomolecular interaction prediction, for the first time. Our CGCN employs Ollivier-Ricci curvature (ORC) to characterize network local structures and to enhance the learning capability of GCNs. More specifically, ORCs are evaluated based on the local topology from node neighborhoods, and further used as weights for the feature aggregation in message-passing procedure. Our CGCN model is extensively validated on fourteen real-world bimolecular interaction networks and a series of simulated data. It has been found that our CGCN can achieve the state-of-the-art results. It outperforms all existing models, as far as we know, in thirteen out of the fourteen real-world datasets and ranks as the second in the rest one. The results from the simulated data show that our CGCN model is superior to the traditional GCN models regardless of the positive-to-negativecurvature ratios, network densities, and network sizes (when larger than 500).

[15] Wee JunJie. (2023). Geometric and Topological AI for Molecular Sciences. Doctoral Thesis. Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165903

Data-driven sciences are widely regarded as the fourth paradigm of sciences that will fundamentally change the society and our daily lives. Indeed, artificial intelligence (AI) models have already revolutionized and transformed various data-intensive industries. Machine learning (ML) and deep learning models have achieved unprecedented extraordinary performance in image, text, audio, video, and network data analysis. This is largely due to the rise in three major advancements, i.e., accumulation of big data, rise in computational power, and design of highly efficient algorithms. In particular, AlphaFold2 made a remarkable achievement for protein-folding problems which heralds a new era for AI-based molecular data analysis for materials, chemistry, and biology. With excitement and opportunities, AI for molecular sciences also comes with challenges. In this dissertation, we will tackle one of the main challenges in AI for molecular sciences which is constructing or designing effective molecular descriptors and fingerprints. Ideally, effective molecular descriptors should preserve the utmost important features while still possessing the ability to capture the intrinsic molecular properties and information that directly dictate molecular functions. In this way, they can be better “understood” by ML models. This has inspired various researchers to apply topological data analysis (TDA) where persistent homology (PH) and its intrinsic topological invariants act as an excellent and robust molecular featurization method to capture and characterize the underlying topological information in biomolecular systems. By extending beyond the capabilities of TDA, we propose geometric and topological AI for molecular sciences. In this dissertation, two novel persistent functions, namely persistent Ricci curvature (PRC) and persistent Dirac operators are developed as new advanced mathematics-based molecular featurization which can build advanced mathematics-based ML models to perform unsupervised and supervised learning in molecular sciences. In biological data, we built Ollivier persistent Ricci curvature and Forman persistent Ricci curvature-based ML models to predict protein-ligand binding affinity values. Also, we constructed a persistent spectral-based ensemble learning model (PerSpect-EL) to capture and characterize protein-protein interactions upon mutational change. Our PerSpect-EL model has outperformed several existing traditional molecular descriptor-based models in protein-protein binding affinity change predictions. In materials data, we designed both PH-based and persistent Forman curvature (PFC)-based ML models to characterize organic-inorganic halide perovskites (OIHPs). Essentially, our PH-based and PFC-based molecular features produced strong discriminating power in classifying 9 types of OIHPs. Both models have also outperformed traditional perovskite descriptors in materials property predictions such as bandgap, dielectric constant, and refractive index.

[14] Jialin Bi, JunJie Wee, Xiang Liu, Cunquan Qu, Guanghui Wang, and Kelin Xia. Multiscale Topological Indices for the Quantitative Prediction of SARS CoV-2 Binding Affinity Change upon Mutations. Journal of Chemical Information and Modeling (2023).

The Coronavirus disease 2019 (COVID-19) has affected people’s lives and the development of the global economy. Biologically, protein–protein interactions between SARS-CoV-2 surface spike (S) protein and human ACE2 protein are the key mechanism behind the COVID-19 disease. In this study, we provide insights into interactions between the SARS-CoV-2 S-protein and ACE2, and propose topological indices to quantitatively characterize the impact of mutations on binding affinity changes (ΔΔG). In our model, a series of nested simplicial complexes and their related adjacency matrices at various different scales are generated from a specially designed filtration process, based on the 3D structures of spike-ACE2 protein complexes. We develop a set of multiscale simplicial complexes-based topological indices, for the first time. Unlike previous graph network models, which give only a qualitative analysis, our topological indices can provide a quantitative prediction of the binding affinity change caused by mutations and achieve great accuracy. In particular, for mutations that happened at specifical amino acids, such as Polar amino acids or Arginine amino acids, the correlation between our topological gravity model index and binding affinity change, in terms of Pearson correlation coefficient, can be higher than 0.8. As far as we know, this is the first time multiscale topological indices have been used in the quantitative analysis of protein–protein interactions.

[13] Wee JunJie, Ginestra Bianconi, Xia Kelin. Persistent Dirac for molecular representation. (2023). Scientific Reports. 13(1), 11183.

Molecular representations are of fundamental importance for the modeling and analysis of molecular systems. Representation models and in general approaches based on topological data analysis (TDA) have demonstrated great success in various steps of drug design and materials discovery. Here we develop a mathematically rigorous computational framework for molecular representation based on the persistent Dirac operator. The properties of the spectrum of the discrete weighted and unweighted Dirac matrices are systemically discussed and used to demonstrate the geometric and topological properties of both non-homology and homology eigenvectors of real molecular structures. This allows us to asses the influence of weighting schemes on the information encoded in the Dirac eigenspectrum. A series of physical persistent attributes, which characterize the spectrum of the Dirac matrices across a filtration, are proposed and used as efficient molecular fingerprints. Finally, our persistent Dirac-based model is used for clustering molecular configurations from nine types of organic-inorganic halide perovskites. We found that our model can cluster the structures very well, demonstrating the representation and featurization power of the current approach.

[12] Choo Hou Yee, Wee JunJie, Shen Cong, Xia Kelin. Fingerprint-Enhanced Graph Attention Network (FinGAT) Model for Antibiotic Discovery. Journal of Chemical Information and Modeling, 63(10), 2928–2935. (2023).

Artificial Intelligence (AI) techniques are of great potential to fundamentally change antibiotic discovery industries. Efficient and effective molecular featurization is key to all highly accurate learning models for antibiotic discovery. In this paper, we propose a fingerprint-enhanced graph attention network (FinGAT) model by the combination of sequence-based 2D fingerprints and structure-based graph representation. In our feature learning process, sequence information is transformed into a fingerprint vector, and structural information is encoded through a GAT module into another vector. These two vectors are concatenated and input into a multilayer perceptron (MLP) for antibiotic activity classification. Our model is extensively tested and compared with existing models. It has been found that our FinGAT can outperform various state-of-the-art GNN models in antibiotic discovery.

[11] Wee JunJie, Le Hai Khoi. On Korenblum Constants for some weighted function spaces. Journal of Mathematics and Mathematical Sciences, 2(8). (2023).

In this paper, we survey the results on the Korenblum Maximum Principle for some weighted function spaces. Progress and results discussed include the upper bounds and lower bounds of Korenblum constants, as well as the failure of the principle for weighted Bergman spaces, weighted Hardy spaces, weighted Bloch spaces, weighted Fock spaces, and mixed norm spaces. Existing and new open questions are provided.

[10] Xia Kelin, Liu Xiang, Wee JunJie. Persistent Homology for RNA Data Analysis. In: Filipek, S. (eds) Homology Modeling. Methods in Molecular Biology, (2023). vol 2627. Humana, New York, NY. (Book Chapter)

Molecular representations are of great importance for machine learning models in RNA data analysis. Essentially, efficient molecular descriptors or fingerprints that characterize the intrinsic structural and interactional information of RNAs can significantly boost the performance of all learning modeling. In this paper, we introduce two persistent models, including persistent homology and persistent spectral, for RNA structure and interaction representations and their applications in RNA data analysis. Different from traditional geometric and graph representations, persistent homology is built on simplicial complex, which is a generalization of graph models to higher-dimensional situations. Hypergraph is a further generalization of simplicial complexes and hypergraph-based embedded persistent homology has been proposed recently. Moreover, persistent spectral models, which combine filtration process with spectral models, including spectral graph, spectral simplicial complex, and spectral hypergraph, are proposed for molecular representation. The persistent attributes for RNAs can be obtained from these two persistent models and further combined with machine learning models for RNA structure, flexibility, dynamics, and function analysis.

[9] D. Vijay Anand, Qiang Xu, JunJie Wee, Kelin Xia, Tze Chien Sum. Topological Feature Engineering for Machine Learning Based Halide Perovskite Materials Design. npj Computational Materials. (2022). 8(1), 203.

Accelerated materials development with machine learning (ML) assisted screening and high throughput experimentation for new photovoltaic materials holds the key to addressing our grand energy challenges. Data-driven ML is envisaged as a decisive enabler for new perovskite materials discovery. However, its full potential can be severely curtailed by poorly represented molecular descriptors (or fingerprints). Optimal descriptors are essential for establishing effective mathematical representations of quantitative structure-property relationships. Here we reveal that our persistent functions (PFs) based learning models offer significant accuracy advantages over traditional descriptor based models in organic-inorganic halide perovskite (OIHP) materials design and have similar performance as deep learning models. Our multiscale simplicial complex approach not only provides a more precise representation for OIHP structures and underlying interactions, but also has better transferability to ML models. Our results demonstrate that advanced geometrical and topological invariants are highly efficient feature engineering approaches that can markedly improve the performance of learning models for molecular data analysis. Further, new structure-property relationships can be established between our invariants and bandgaps. {We anticipate that our molecular representations and featurization models will transcend the limitations of conventional approaches and lead to breakthroughs in perovskite materials design and discovery.

[8] Wee JunJie, Le Hai Khoi. Korenblum constants for various weighted Fock spaces. Complex Variables and Elliptic Equations. (2022), pp. 1-22.

We study the Korenblum Maximum Principle for various weighted Fock spaces. The main tool relies on applying special cases of Ramanujan's Master Theorem involving the Gamma function and the Mellin transform of Dirichlet series. It is interesting that with elementary probe functions, we still can obtain closed-form upper bound expressions, in terms of several well-known special functions, for the Korenblum constants of various weighted Fock spaces. For the first time, we obtain upper bounds of the Korenblum constant for the finite and infinite intersections of weighted Fock spaces and proved that the Korenblum Maximum Principle fails for those infinite intersections of weighted Fock spaces. Some open questions are provided.

[7] Ronald Koh Joon Wei, JunJie Wee, Valerie Evangelin Laurent and Kelin Xia. Hodge theory-based biomolecular data analysis. (2022). Scientific Reports. 12(1), 1-16.

Hodge theory reveals the deep intrinsic relations of differential forms and provides a bridge between differential geometry, algebraic topology, and functional analysis. Here we use Hodge Laplacian and Hodge decomposition models to analyze biomolecular structures. Different from traditional graph-based methods, biomolecular structures are represented as simplicial complexes, which can be viewed as a generalization of graph models to their higher-dimensional counterparts. Hodge Laplacian matrices at different dimensions can be generated from the simplicial complex. The spectral information of these matrices can be used to study intrinsic topological information of biomolecular structures. Essentially, the number (or multiplicity) of k-th dimensional zero eigenvalues is equivalent to the k-th Betti number, i.e., the number of k-th dimensional homology groups. The associated eigenvectors indicate the homological generators, i.e., circles or holes within the molecular-based simplicial complex. Furthermore, Hodge decomposition-based HodgeRank model is used to characterize the folding or compactness of the molecular structures, in particular, the topological associated domain (TAD) in high-throughput chromosome conformation capture (Hi-C) data. Mathematically, molecular structures are represented in simplicial complexes with certain edge flows. The HodgeRank-based average/total inconsistency (AI/TI) is used for the quantitative measurements of the folding or compactness of TADs. This is the first quantitative measurement for TAD regions, as far as we know.

[6] Gong Weikang, Wee JunJie, Wu Min-Chun, Sun Xiaohan, Li Chunhua, Xia Kelin. Persistent spectral simplicial complex-based machine learning (PerSpectSC-ML) for chromosomal structural analysis in cellular differentiation. Briefings in Bioinformatics. (2022). Volume 23, Issue 4, bbac168.

The three-dimensional (3D) chromosomal structure plays an essential role in all DNA-templated processes, including gene transcription, DNA replication, and other cellular processes. Although developing chromosome conformation capture (3C) methods, such as Hi-C, which can generate chromosomal contact data characterized genome-wide chromosomal structural properties, understanding 3D genomic nature-based on Hi-C data remains lacking. Here, we propose a persistent spectral simplicial complex (PerSpectSC) model to describe Hi-C data for the first time. Specifically, a filtration process is introduced to generate a series of nested simplicial complexes at different scales. For each of these simplicial complexes, its spectral information can be calculated from the corresponding Hodge Laplacian matrix. PerSpectSC model describes the persistence and variation of the spectral information of the nested simplicial complexes during the filtration process. Different from all previous models, our PerSpectSC-based features provide a quantitative global-scale characterization of chromosome structures and topology. Our descriptors can successfully classify cell types and also cellular differentiation stages for all the 24 types of chromosomes simultaneously. In particular, persistent minimum best characterizes cell types and Dim (1) persistent multiplicity best characterizes cellular differentiation. These results demonstrate the great potential of our PerSpectSC-based models in polymeric data analysis.

[5] Wee JunJie, Xia Kelin. Persistent spectral based ensemble learning (PerSpect-EL) for protein-protein binding affinity prediction. Briefings in Bioinformatics. (2022). Volume 23, Issue 2, bbac024.

Protein-protein interactions (PPIs) play a significant role in nearly all cellular and biological activities. Data-driven machine learning models have demonstrated great power in PPIs. However, the design of efficient molecular featurization poses a great challenge for all learning models for PPIs. Here we propose persistent spectral (PerSpect) based PPI representation and featurization, and PerSpect based ensemble learning (PerSpect-EL) models for PPI binding affinity prediction, for the first time. In our model, a sequence of Hodge (or combinatorial) Laplacian (HL) matrices at various different scales are generated from a specially-designed filtration process. PerSpect attributes, which are statistical and combinatorial properties of spectrum information from these HL matrices, are used as features for PPI characterization. Each PerSpect attribute is input into a 1D convolutional neural network (CNN), and these CNN networks are stacked together in our PerSpect based ensemble learning models. We systematically test our model on the two most commonly-used datasets, i.e., SKEMPI and AB-Bind. It has been found that our model can achieve state-of-the-art results and outperform all existing models to the best of our knowledge.

[4] Wee JunJie, Xia Kelin. Forman persistent Ricci curvature (FPRC) based machine learning for protein-ligand binding affinity prediction. Briefings in Bioinformatics (2021). Volume 22, Issue 6, bbab136.

Artificial intelligence (AI) techniques have already been gradually applied to the entire drug design process, from target discovery, lead discovery, lead optimization and preclinical development to the final three phases of clinical trials. Currently, one of the central challenges for AI-based drug design is molecular featurization, which is to identify or design appropriate molecular descriptors or fingerprints. Efficient and transferable molecular descriptors are key to the success of all AI-based drug design models. Here we propose Forman persistent Ricci curvature (FPRC)-based molecular featurization and feature engineering, for the first time. Molecular structures and interactions are modeled as simplicial complexes, which are generalization of graphs to their higher dimensional counterparts. Further, a multiscale representation is achieved through a filtration process, during which a series of nested simplicial complexes at different scales are generated. Forman Ricci curvatures (FRCs) are calculated on the series of simplicial complexes, and the persistence and variation of FRCs during the filtration process is defined as FPRC. Moreover, persistent attributes, which are FPRC-based functions and properties, are employed as molecular descriptors, and combined with machine learning models, in particular, gradient boosting tree (GBT). Our FPRC-GBT models are extensively trained and tested on three most commonly-used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. It has been found that our results are better than the ones from machine learning models with traditional molecular descriptors.

[3] Wee JunJie, Xia Kelin. Ollivier Persistent Ricci Curvature-Based Machine Learning for the Protein-Ligand Binding Affinity Prediction. Journal of Chemical Information and Modeling (2021). 61(4), 1617-1626.

Efficient molecular featurization is one of the major issues for machine learning models in drug design. Here, we propose a persistent Ricci curvature (PRC), in particular, Ollivier PRC (OPRC), for the molecular featurization and feature engineering, for the first time. The filtration process proposed in the persistent homology is employed to generate a series of nested molecular graphs. Persistence and variation of Ollivier Ricci curvatures on these nested graphs are defined as OPRC. Moreover, persistent attributes, which are statistical and combinatorial properties of OPRCs during the filtration process, are used as molecular descriptors and further combined with machine learning models, in particular, gradient boosting tree (GBT). Our OPRC-GBT model is used in the prediction of the protein–ligand binding affinity, which is one of the key steps in drug design. Based on three of the most commonly used data sets from the well-established protein–ligand binding databank, that is, PDBbind, we intensively test our model and compare with existing models. It has been found that our model can achieve the state-of-the-art results and has advantages over traditional molecular descriptors.

[2] Wee JunJie, Le Hai Khoi. Korenblum constants for some function spaces. Proc. Amer. Math. Soc. 148 (2020), pp. 1175-1185.

We study the Korenblum Maximum Principle on the weighted Fock spaces and the weighted Bergman spaces with exponential weights. First, we give explicit expressions for the upper bounds of Korenblum constants for the weighted Fock spaces. Then, we obtain upper bounds of such constants for the weighted Bergman spaces. Finally, we show a failure of the Korenblum Maximum Principle for weighted Bergman spaces $A^p_\alpha(\mathbb{D})$ where $p\in (0,1)$ and $\alpha>0$.

[1] Wee JunJie. (2019). The Korenblum Maximum Principle for some function spaces. Bachelor Thesis. Nanyang Technological University, Singapore. https://hdl.handle.net/10356/77142

We study the Korenblum Maximum Principle on the weighted Fock space $\mathcal{F}^p_\alpha(\mathbb{C})$ and the weighted Bergman space $A^p_\alpha(\mathbb{D})$ under exponential weights $e^{−\frac{p\alpha}{2}|z|^2}$. We obtain explicit expressions for the upper bounds of Korenblum constants for the weighted Fock space $\mathcal{F}^p_\alpha(\mathbb{C})$ , $p\geq 1$ and $\alpha > 0$. Then, we obtain upper bounds of such constants for the weighted Bergman space $A^p_\alpha(\mathbb{D})$ , $p \geq 1$ and $\alpha \geq 0$. We also show a failure of the Korenblum Maximum Principle for weighted Bergman space $A^p_\alpha(\mathbb{D})$ , where $p\in (0,1)$, $\alpha > 0$, thus bringing closure of the problem under weighted Bergman space $A^p_\alpha(\mathbb{D})$ where $\alpha > 0$.

JunJie Wee

Preprints/Submitted

Published/Accepted

JunJie Wee

Preprints/Submitted

Published/Accepted +

Published/Accepted