[2] Wee JunJie. (2023). Geometric and Topological AI for Molecular Sciences. Doctoral Thesis. Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165903
Data-driven sciences are widely regarded as the fourth paradigm of sciences that will fundamentally change the society and our daily lives. Indeed, artificial intelligence (AI) models have already revolutionized and transformed various data-intensive industries. Machine learning (ML) and deep learning models have achieved unprecedented extraordinary performance in image, text, audio, video, and network data analysis. This is largely due to the rise in three major advancements, i.e., accumulation of big data, rise in computational power, and design of highly efficient algorithms. In particular, AlphaFold2 made a remarkable achievement for protein-folding problems which heralds a new era for AI-based molecular data analysis for materials, chemistry, and biology. With excitement and opportunities, AI for molecular sciences also comes with challenges. In this dissertation, we will tackle one of the main challenges in AI for molecular sciences which is constructing or designing effective molecular descriptors and fingerprints. Ideally, effective molecular descriptors should preserve the utmost important features while still possessing the ability to capture the intrinsic molecular properties and information that directly dictate molecular functions. In this way, they can be better “understood” by ML models. This has inspired various researchers to apply topological data analysis (TDA) where persistent homology (PH) and its intrinsic topological invariants act as an excellent and robust molecular featurization method to capture and characterize the underlying topological information in biomolecular systems. By extending beyond the capabilities of TDA, we propose geometric and topological AI for molecular sciences. In this dissertation, two novel persistent functions, namely persistent Ricci curvature (PRC) and persistent Dirac operators are developed as new advanced mathematics-based molecular featurization which can build advanced mathematics-based ML models to perform unsupervised and supervised learning in molecular sciences. In biological data, we built Ollivier persistent Ricci curvature and Forman persistent Ricci curvature-based ML models to predict protein-ligand binding affinity values. Also, we constructed a persistent spectral-based ensemble learning model (PerSpect-EL) to capture and characterize protein-protein interactions upon mutational change. Our PerSpect-EL model has outperformed several existing traditional molecular descriptor-based models in protein-protein binding affinity change predictions. In materials data, we designed both PH-based and persistent Forman curvature (PFC)-based ML models to characterize organic-inorganic halide perovskites (OIHPs). Essentially, our PH-based and PFC-based molecular features produced strong discriminating power in classifying 9 types of OIHPs. Both models have also outperformed traditional perovskite descriptors in materials property predictions such as bandgap, dielectric constant, and refractive index.