About this Abstract |
Meeting |
2024 TMS Annual Meeting & Exhibition
|
Symposium
|
AI/Data Informatics: Computational Model Development, Verification, Validation, and Uncertainty Quantification
|
Presentation Title |
Natural Language Processing and Large Language Models for Automated Extraction of Materials Chemistry Data from Literature |
Author(s) |
Taylor D. Sparks, Sterling Baird, Hasan Sayeed, Ramsey Issa |
On-Site Speaker (Planned) |
Taylor D. Sparks |
Abstract Scope |
The lack of materials data in machine-readable formats in scientific literature published by academics remains one of the biggest challenges limiting materials informatics approaches. Currently, organizing materials data repositories is exceedingly difficult due to the prevalent use of PDFs. Previous attempts to encourage materials chemists to adopt machine-readable formats have proven unsuccessful and as a response, several information extraction tools including rules-based parsing, small-language models, and fine-tuned large language models (LLMs) have emerged. These tools suffer from poor accuracy and tedious hand-labeling. Our proposed solution takes advantage of the remarkable advancements in natural language processing (NLP) and combines them with the Pauling File materials as a highly curated and hand-labeled data source to bypass the need for hand labeling. We seek to empower chemists to continue writing papers as they always have while simultaneously harnessing the power of NLP and LLMs to automatically extract and organize crucial materials data. |
Proceedings Inclusion? |
Planned: |
Keywords |
Machine Learning, ICME, Modeling and Simulation |