Quantum mechanical molecular ‘fingerprints’ solve machine learning mystery

There is more than one way to describe a water molecule, especially when communicating with a machine learning (ML) model, says chemist Robert DiStasio. You can feed the algorithm the molecule’s structural information: two hydrogen atoms flanking an oxygen atom with the bonds a certain length and a certain bond angle. 

Or you could use the molecule’s quantum mechanical information. That is, if you can package this complex information in a compact manner that is understandable to the ML algorithm. Cornell chemists have just discovered how.  

Their new method, Semi-Local Density Fingerprints (SLDFs), can predict molecular properties with up to 100 times more accuracy than the current most popular method for modeling molecules and materials.

In their quest to use ML to predict molecular properties, DiStasio, associate professor of chemistry and chemical biology in the College of Arts and Sciences, and members of his lab have found a way to encapsulate a molecule’s quantum mechanical information so they can feed that – rather than simpler structural information like the identities and positions of the atoms – into ML algorithms, providing orders of magnitude greater accuracy than using only Density Functional Theory (DFT), the current most popular method. 

DiStasio is the corresponding author of “Learning Molecular Conformational Energies Using Semi-Local Density Fingerprints,” published in the Journal of Physical Chemistry Letters on Dec. 17.

“Most molecular descriptors used in machine learning have been based on some combination of the chemical composition and atomic positions,” DiStasio said. “I wondered, why not run an approximate quantum mechanical calculation on the molecule – of the sort that we do every day – and then use that wealth of information as a molecular descriptor?”

In the paper, DiStasio and colleagues build a unique molecular descriptor based on some straightforward quantum mechanical calculations using DFT. “It only takes a computer a few minutes to finish a DFT calculation,” said Zhuofan Shen, a doctoral student in chemistry and chemical biology and first author of the study. “We can then use the output from that calculation to build a molecular fingerprint that contains valuable quantum mechanical information.”

For a water molecule, the fingerprint that is fed into the ML model consists not of the atoms’ identities and positions, but rather a compact representation of the molecule’s electron density – the likelihood of finding any one of its ten electrons at some particular location. There are ten electrons in a water molecule (one from each hydrogen atom and eight from the oxygen atom), and the electrons are in constant motion, giving the problem of calculating their interactions its quantum mechanical nature. 

In this study, the researchers used ML to predict conformational energies. A conformational energy is the difference between the energy in, say, a room temperature water molecule and a water molecule in a highly perturbed state. To do so, the researchers trained a ML model on a database containing thousands of conformational energies.

“Our goal in this work was to challenge the ML model. Instead of asking it to predict conformational energies for a molecule in the training set, we asked the model to predict conformational energies for a molecule it has never seen,” DiStasio said. “It can do that. And it can do that better than DFT.”

Because all molecules have electrons, SLDFs help the ML model extrapolate to molecules that were not encountered during training. For instance, this study trained a ML model on molecules containing atoms from the first two rows of the periodic table (atoms such as hydrogen, carbon, oxygen, and nitrogen), and it was able to predict conformational energies for molecules containing third-row atoms.

“How can ML make accurate predictions on molecules containing atoms from the third row of the periodic table, such as sulfur or phosphorous, if it has only ever seen molecules with atoms from the second row?” said Zachary Sparrow, postdoctoral research associate and co-author on the paper. “SLDFs solve that mystery by focusing on what really matters – the electrons.”

Applicable to any system with electrons, SLDFs can be used in conjunction with ML models to address current questions in chemistry, physics and material science. In chemistry, DiStasio’s group is now using this technique to predict reaction energies, barriers to reactions (related to reaction time), and molecular properties. 

“We have demonstrated that molecular descriptors built from quantum mechanical information can extend the accuracy, reliability and transferability of ML,” DiStasio said. “I am excited to explore their potential in the design and discovery of new molecules and materials with targeted properties.”

Co-first authors of the study are Zhuofan Shen, doctoral student, and Yang Yang, Ph.D. ’22. Contributing authors include postdoctoral research associate Zachary Sparrow, Brian Ernst, Ph.D. ’21, research associate Trine Quady, Richard Kang ’19, Justin Lee ’21, Yan Yang, Ph.D. ’21, and Lijie Tu, Ph.D. ’19.

The study received financial support from the Camille and Henry Dreyfus Foundation, the Cornell Center for Materials Research, and an Alfred P. Sloan Research Fellowship, as well as computational resources from the U.S. Department of Energy. 

More News from A&S

hand in a blue glove holding a beaker with clear liquid in it
RephiLe water/Unsplash Cornell chemists have found a way to encapsulate a molecule’s quantum mechanical information so they can feed that – rather than simpler structural information – into ML algorithms, providing up to 100 times more accuracy than the current most popular method