Research published recently in PNAS has demonstrated that machine learning approaches originally developed for analyzing languages, and used by major companies such as Netflix, Amazon, and Facebook to improve customer experiences, can be applied to uncover the molecular principles behind biomolecular condensate formation, a process that is implicated in a vast range of diseases from neurodegenerative disorders to cancer.
The research is first-authored by 2019 Fellow, Kadi Liis Saar, and supported by Schmidt Science Fellows through Kadi’s Fellowship Research Placement at the University of Cambridge.
Kadi utilized the support from Schmidt Science Fellows to pivot into molecular machine learning, working with Alpha Lee and Tuomas Knowles at Cambridge. Kadi said: “This ground breaking study now could lead in the future to the ability to correct the grammatical mistakes inside cells that cause disease.”
I am incredibly excited about the potential that molecular machine learning and natural language processing based methods in particular can introduce to the field of biomolecular condensates. The support of the Schmidt Science Fellowship programme and its network has been of key importance in my journey of bringing these areas together. – Kadi Liis Saar
Kadi used machine-learning technology often employed by online consumer and machine translation companies to train a large-scale language model to look at what happens when something goes wrong with proteins inside the body to cause disease.
She said: “The human body is home to thousands and thousands of proteins and scientists don’t yet know the function of many of them. We asked a neural network based language model to learn the language of proteins. We specifically asked the programme to learn the language of shapeshifting biomolecular condensates – droplets of proteins found in cells – that scientists really need to understand to crack the language of biological function and malfunction that cause cancer and neurodegenerative diseases like Alzheimer’s. We found it could learn, without being explicitly told, what scientists have already discovered about the language of proteins over decades of research.”
Proteins are large, complex molecules that play many critical roles in the body. They do most of the work in cells and are required for the structure, function and regulation of the body’s tissues and organs – antibodies, for example, are a protein that function to protect the body.
Alzheimer’s, Parkinson’s and Huntington’s diseases are three of the most common neurodegenerative diseases, but scientists believe there are several hundred.
“We fed the algorithm all of data held on the known proteins so it could learn and predict the language of proteins in the same way these models learn about human language and how WhatsApp knows how to suggest words for you to use.”
“Then we were able ask it about the specific grammar that leads only some proteins to form condensates inside cells. It is a very challenging problem and unlocking it will help us learn the rules of the language of disease.”
The machine-learning technology is developing at a rapid pace due to the growing availability of data, increased computing power, and technical advances which have created more powerful algorithms.
Kadi is currently supported by Schmidt Science Fellows to continue her work on an Additional Study Grant. She commented on the support the Fellowship has provided to her work: “I am incredibly excited about the potential that molecular machine learning and natural language processing based methods in particular can introduce to the field of biomolecular condensates. The support of the Schmidt Science Fellowship programme and its network has been of key importance in my journey of bringing these areas together.”