AI Tools for Data-Driven Molecular Engineering

Diagram showing how ChemDataExtractor is used to build materials databases

A schematic showing how scientific literature is mined using ChemDataExtractor to build materials databases. These databases are then employed to generate Q&A pairs, which are used to fine-tune efficient, materials-domain-specific language models. (Image: Mayank Shreshtha, University of Cambridge.)

Case Study
Diagram showing how ChemDataExtractor is used to build materials databases

A schematic showing how scientific literature is mined using ChemDataExtractor to build materials databases. These databases are then employed to generate Q&A pairs, which are used to fine-tune efficient, materials-domain-specific language models. (Image: Mayank Shreshtha, University of Cambridge.)

 

As the volume of scientific literature grows, materials scientists face the challenge of extracting the most relevant information quickly enough to guide experiments and discover new materials for a wide range of applications. With support from ALCF computing resources, researchers from the University of Cambridge are developing AI tools that can mine millions of research papers, build structured materials databases, and train specialized language models to accelerate discovery.

Challenge

Large language models (LLMs) show promise for answering scientific questions, but training them from scratch on domain-specific data requires enormous computing resources. Even fine-tuning general-purpose models can be less effective if the training data lacks the precision needed for specialized scientific tasks. For many areas of materials science, there are also gaps in the availability of structured, machine-readable property data to enable such training.

Approach

To overcome these barriers, researchers are leveraging ALCF resources to combine automated text mining with innovative AI training approaches. Using their ChemDataExtractor natural language processing toolkit, the team auto-generates high-quality datasets from scientific articles and builds structured materials databases, which are then fed into small, efficient language models, bypassing the need for costly pretraining of archetypical LLMs. ALCF supercomputers have been central to developing these workflows, enabling large-scale data extraction, algorithm development, and model training.

Results

In one study, the team transformed a photovoltaic materials database into hundreds of thousands of structured question-and-answer (QA) pairs, achieving up to 20% higher accuracy on solar cell–related tasks than models trained on general-purpose data. They applied the same strategy to other domains, including building a stress–strain property database with more than 720,000 records to train MechBERT, a model that outperforms standard tools in predicting material behavior under stress. Another project targeted optoelectronics, showing that domain-adaptive pretraining can be done with over 80% less computational cost than traditional methods while maintaining or improving accuracy. Most recently, the group generated nearly 100,000 QA pairs from a thermoelectric materials database, demonstrating that a small BERT model fine-tuned on this dataset, and further improved by mixing it with a general-purpose dataset, can outperform models trained on either source alone.

Impact

By turning unstructured literature into targeted, machine-readable knowledge, the team is making AI tools more accessible to the materials research community. Together, their efforts illustrate a scalable approach to building smaller, faster language models for specialized materials science domains and help scientists tailor models to their own areas, design better experiments, interpret results, and accelerate the path from discovery to application.

Publications


Kumar, P., S. Kabra, and J. M. Cole. “A Database of Stress-Strain Properties Auto-generated from the Scientific Literature using ChemDataExtractor,” Scientific Data (November 2024), Springer Nature.
https://doi.org/10.1038/s41597-024-03979-6

Kumar, P., S. Kabra, and J. M. Cole. “MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain,” Journal of Chemical Information and Modeling (January 2025), American Chemical Society.
https://doi.org/10.1021/acs.jcim.4c00857

Li, Z., and J. M. Cole. “Auto-generating Question-Answering Datasets with Domain-Specific Knowledge for Language Models in Scientific Tasks,” Digital Discovery (February 2025), Royal Society of Chemistry.
https://doi.org/10.1039/D4DD00307A

Sierepeklis, O., and J. M. Cole. “Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models,” Journal of Chemical Information and Modeling (August 2025), American Chemical Society.
https://doi.org/10.1021/acs.jcim.5c00840

 

 

Allocations
Systems