mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Heng Ji, University of Michigan, Ann Arbor
CELS seminar graphic featuring the title and date.

Abstract: Unlike machines, human scientists are inherently “multilingual”, seamlessly navigating diverse modalities, from natural language in literature to complex scientific data such as molecular structures in knowledge bases. Human scientists also “think before they talk”: they ground their reasoning in deliberate reflection and subject new ideas to critical evaluation and verification. Against this backdrop, I argue that there is a correctable fundamental mismatch between the way LLMs work and the way scientists traditionally discover and verify new research hypotheses. I propose to design new LLM paradigms by drawing inspiration from the scientific discovery process itself: (1) “Observe”- acquire, represent and integrate knowledge from multiple data modalities; (2) “Think” - think critically to generate hypotheses; and (3) “Propose and Verify” - verify hypotheses through the Physical World. 

As a prototype example, I will present mCLM, a modular Chemical-Language Model that speaks two complementary languages: one that represents molecular building blocks indicative of specific functions and compatible with automated modular assembly, and another that describes these functions in natural language. Experiments on 430 FDA-approved drugs showed that mCLM is capable of significantly improving chemical functions critical to determining drug potentials. mCLM, with only 3B parameters, also achieves improvements in function scores and synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials (“fallen angels”). Preliminary animal testing results further underscore the promise of this approach. I will then talk about the challenges we are facing to adapt this model to generate discover Organic Photovoltaic (OPV) materials, especially on LLM architecture design, and propose some initial solutions.

In the long term, I envision a comprehensive, multi-agent, human-in-the-loop autonomous laboratory, structured around iterative cycles of reasoning, proposal, synthesis, physical testing, feedback, and reasoning to enable never-ending self-improvement and co-evolvement with human scientists.

Speaker: Heng Ji is a Professor of Computer Science at Siebel School of Computing and Data Science, and a faculty member affiliated with Electrical and Computer Engineering Department, Coordinated Science Laboratory, and Carl R. Woese Institute for Genomic Biology of University of Illinois Urbana-Champaign. She is an Amazon Scholar. She is the Founding Director of Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE), and the Founding Director of CapitalOne-Illinois Center on AI Safety and Knowledge Systems (ASKS). She received Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge-enhanced Large Language Models and Vision-Language Models, AI for Science, and Science-inspired AI. The awards she received include Outstanding Paper Award at ACL2024, two Outstanding Paper Awards at NAACL2024, "Young Scientist" by the World Laureates Association in 2023 and 2024, "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017, "Women Leaders of Conversational AI" (Class of 2023) by Project Voice, "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, PACLIC2012 Best paper runner-up, "Best of ICDM2013" paper award, "Best of SDM2013" paper award, ACL2018 Best Demo paper nomination, ACL2020 Best Demo Paper Award, NAACL2021 Best Demo Paper Award, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018. She has coordinated the NIST TAC Knowledge Base Population task 2010-2020. She served as the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and the Program Committee Co-Chair of many conferences including NAACL-HLT2018 and AACL-IJCNLP2022. She was elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2023.

Scientific discovery, especially for new drugs and materials, urgently needs our help. The traditional manual approach is highly artisanal, and thus slow and expensive. Most importantly, many commercial drugs or materials have well-documented limitations that have remained unaddressed. In fact, AI for Science has become a rapidly growing field, especially through approaches powered by large language models (LLMs). However, much of the existing generative AI work merely classifies properties of known molecules and thus discovers nothing, or generates molecules that are chemically impossible to make.