About this Abstract |
Meeting |
TMS Specialty Congress 2024
|
Symposium
|
2nd World Congress on Artificial Intelligence in Materials & Manufacturing (AIM 2024)
|
Presentation Title |
Not as Simple as We thought: A Rigorous Examination of Data Aggregation in Materials Informatics |
Author(s) |
Taylor D. Sparks, Federico Ottomano, Giovanni De Felice, Vladimir Gusev |
On-Site Speaker (Planned) |
Taylor D. Sparks |
Abstract Scope |
Recent machine learning developments have opened new possibilities for materials research. However, due to the underlying statistical nature, the performance of machine learning estimators is heavily affected by the quality of training datasets, which are severely limited and fragmented in the case of materials informatics. Here, we investigate whether state-of-the-art machine learning models for property predictions can benefit from aggregation of different datasets. We probe three different aggregation strategies in which we prioritize training size, element diversity, and composition diversity by using novelty scores from the DiSCoVeR algorithm. Surprisingly, our results consistently show that both simple and refined data aggregation strategies lead to a reduction in performance. This suggests caution when merging different experimental data sources. To guide the size increment, we compare the use of DiSCoVeR, which prioritizes chemical diversity, with a random selection. Our results show that targeting novel chemistries is not beneficial in building a training dataset. |
Proceedings Inclusion? |
Definite: Other |