About this Abstract |
Meeting |
2024 TMS Annual Meeting & Exhibition
|
Symposium
|
AI/Data Informatics: Computational Model Development, Verification, Validation, and Uncertainty Quantification
|
Presentation Title |
Not as Simple as We Thought: A Rigorous Examination of Data Aggregation in Materials Informatics |
Author(s) |
Taylor D. Sparks, Federico Ottomano, Giovanni De Felice, Vladimir Gusev |
On-Site Speaker (Planned) |
Taylor D. Sparks |
Abstract Scope |
Recent Machine Learning (ML) developments have opened new possibilities for materials research. However, due to the underlying statistical nature, the performance of ML estimators is heavily affected by the quality of training datasets, which are severely limited and fragmented in the case of materials informatics. Here, we investigate whether state-of-the-art ML models for property predictions can benefit from the aggregation of different datasets. We probe three different aggregation strategies in which we prioritize training size, element diversity, and composition diversity by using novelty scores from the DiSCoVeR algorithm. Surprisingly, our results consistently show that both simple and refined data aggregation strategies lead to a reduction in performance. This suggests caution when merging different experimental data sources. To guide the size increment, we compare the use of DiSCoVeR, which prioritizes chemical diversity, with a random selection. Our results show that targeting novel chemistries is not beneficial in building a training dataset. |
Proceedings Inclusion? |
Planned: |
Keywords |
Machine Learning, Computational Materials Science & Engineering, |