Presented By: Frontiers in Scientific Machine Learning (FSML)
FSML Lecture Series: Tokenization for Chemistry
Alex Wadell (University of Michigan)
Abstract:
Molecular Foundation Models are emerging as a powerful tool for molecular design, material science, and cheminformatics. By leveraging the transformer architecture, these models attempt to learn the language of chemistry and discover robust molecular embeddings. However, current models are constrained by tokenizers that fail to capture the full breadth of chemical space or even the periodic table of elements. In his talk, Alex will introduce smirk, a new tokenizer for molecular foundation models that can represent the entirety of the OpenSMILES specification. We'll also discuss performance metrics for tokenizers and the results of Alex's systematic evaluation of thirteen chemistry-specific tokenizers using N-gram language models as a low-cost proxy for transformer models.
Molecular Foundation Models are emerging as a powerful tool for molecular design, material science, and cheminformatics. By leveraging the transformer architecture, these models attempt to learn the language of chemistry and discover robust molecular embeddings. However, current models are constrained by tokenizers that fail to capture the full breadth of chemical space or even the periodic table of elements. In his talk, Alex will introduce smirk, a new tokenizer for molecular foundation models that can represent the entirety of the OpenSMILES specification. We'll also discuss performance metrics for tokenizers and the results of Alex's systematic evaluation of thirteen chemistry-specific tokenizers using N-gram language models as a low-cost proxy for transformer models.
Co-Sponsored By
Explore Similar Events
-
Loading Similar Events...