Generative AI is revolutionizing the field of drug discovery, offering unprecedented opportunities to accelerate the development of new treatments. It is estimated that there are approximately $10^{60}$ possible compounds that could potentially be explored for drug development. This immense number presents a daunting task for scientists aiming to identify effective drug candidates. Traditionally, discovering and developing new drugs is a lengthy and costly process, often taking years of research and billions of dollars in investment.
Drug Discovery Pipeline
The drug discovery process typically follows a structured pipeline, which includes several key stages:- Target Discovery: Identifying a biological target, such as a protein, that is associated with a disease. This process requires extensive research and data mining to find relevant targets that can be drugged.
- Hit Identification: In this phase, a large number of compounds are screened against the identified target to find initial candidates, referred to as “hits.” This can be done through high-throughput screening or virtual screening methods.
- Hit to Lead: The hits are further refined to identify potential lead compounds. This involves filtering down from thousands of candidates to a smaller, more manageable number based on activity and other properties.
- Lead Optimization: The selected lead compounds undergo optimization to enhance their properties, such as bioactivity and selectivity while minimizing side effects.
- Accelerating the Design-Make-Test-Analyze Cycle: ML can speed up the design and analysis phases of the drug discovery pipeline, reducing the time and cost associated with traditional methods.
- Generative Models: ML can generate new ideas for drug candidates, providing medicinal chemists with a wider range of options to explore.
- Reducing Costly Resources: ML can help prioritize experiments and reduce the lab’s use of materials and time.
Molecular Representation with Graph Neural Networks (GNN)
GNN is used in drug discovery because it can naturally represent molecules as graphs, capture complex chemical and biological interactions, and predict crucial molecular properties. Additionally, it enables the optimization and discovery of new therapeutic molecules, thus accelerating drug research and development while reducing discovery process costs.Molecular Graph Representation
- V is the set of “nodes”, representing the atoms,
- E is the set of “edges”, representing the bonds between atoms.
Feature Encoding: One-Hot Encoding
To numerically represent these features, “One-Hot Encoding” is used. Each category of atoms or bonds is represented as a binary vector in which only the position corresponding to the category is active (1), and all other positions are inactive (0). This allows the molecular structures to be transformed into numerical formats that GNNs can process.
“Example of One-Hot Encoding”: If we consider $CH_3$, it can be represented as:
СНз : C : [1, 0, 0] H : [0, 1, 0], [0, 1, 0], [0, 1, 0]
Graph Convolution Layers
The first step in the learning process with a GNN is the application of “Graph Convolution Layers”. In this step, the model learns to represent and update the features of the nodes (atoms) and edges (bonds) by aggregating information from neighboring nodes.
Each node (atom) starts with an initial feature vector $x_i$, and through convolution, it updates its embedding $h_i^{(k+1)}$ using information from adjacent nodes. This process allows each node to learn not only its own features but also those of its neighbors.
The formula for updating the node in a convolution layer is as follows:
$$h_i^{(k+1)}=\sigma(\sum_{j\in N(i)}\frac{1}{c_{ij}}W^{(k)}h_i^{(k)}+b^{(k)})$$
where:
- $h_i^k$ is the embedding of node $i$ at layer $k$,
- $N(i)$ is the set of neighbours of node $i$,
- $_{ij}$ is a normalization factor that depends on the degrees of the nodes,
- $-W^{(k)}$ is a wight matrix,
- $b^{(k)}$ is a bias term,
- $\sigma$ is an activation function
This process is repeated over multiple layers, with each node aggregating information from increasingly distant nodes.
Graph Embeddings
After applying multiple convolution layers, each node has an embedding that contains information not only about its intrinsic atomic properties but also about how it is connected to the surrounding chemical structure. However, in drug discovery, the goal is to obtain a representation of the entire molecule, not just individual atoms.
To obtain the embedding of the entire molecule, the node embeddings are aggregated into a single vector representation via a process called “Read Out”. Common aggregation methods include:
– “Mean Pooling”: The graph representation is the mean of the node representations:
$$h_{graph}=\frac{1}{|V|}\sum_{i\in V}h_i$$
where |V| is the number of nodes in the graph.
– Max Pooling: The maximum value is taken for each dimension of the embeddings across all nodes
– Attention Pooling: An attention mechanism is applied to weigh the nodes differently before aggregating them, based on their importance.
The output of this phase is an embedding vector representing the entire molecule.
Read Out
(Aggregation of Representations)
The “Read Out” is the stage in which all the information learned from the nodes in the graph is combined into a single representation that can be used for final predictions. This step synthesizes the entire molecule into a vector, which can be used to predict properties such as toxicity, pharmacological activity, or interaction with a target.
Fully Connected Network (FCN)
Once the final graph embedding is obtained through the “Read Out”, the molecular representation is passed through a “Fully Connected Network (FCN)” to make the final prediction. This neural network consists of multiple fully connected layers, followed by non-linear activation functions, which learn how to map the graph embedding to a prediction.
The process of a Fully Connected Network involves:
– Input: The vector obtained from the Read Out.
– Fully connected layers: Each layer is a linear transformation followed by an activation function.
– Output: The final prediction, which can be:
– Classification: For example, predicting toxicity or non-toxicity, or activity against a target.
– Regression: For example, predicting a continuous property.
Examples of the application of AI in drug discovery
The 3D structure of proteins: AlphaFold
To design effective drugs it is essential to know the 3D structure of the protein they are going to target and then bind to (by recognizing the receptors present on the surface of the cell).
The challenge for researchers is to synthesize drug molecules that only bind the targeted receptor without affecting other similar proteins.
For this purpose DeepMind developed an AI program called AlphaFold, that has revolutionized structural biology by predicting protein structures with remarkable accuracy.
Proteins are indeed made from long chains of amino acids, each of which has a unique complex 3D structure; to figure out one of these can take several years (and is for sure an expensive research).
In 2020, AlphaFold solved this problem, with the ability to predict protein structures (the system uses a deep learning model trained on amino acid sequences of well known proteins from the Protein Data Bank) in minutes, to a remarkable degree of accuracy.
That’s helping researchers understand what individual proteins do and how they interact with other molecules.
AlphaFold relies on deep learning, specifically on a type of neural network architecture known as “transformer” that works as follows:
- Input: The model takes as input the amino acid sequence of a protein.
- Prediction: It predicts the spatial arrangement of the atoms within the protein (that is how the protein folds into its 3D structure).
- Learning Process: During training, AlphaFold studies known protein structures in the Protein Data Bank, learning the rules and patterns that govern protein folding.
AI & Data Analytics: Netabolics
Nowadays we are witnessing the birth of many startups that are blending together big data, AI, and machine learning to automate data processing and solve complex problems much more quickly than traditional data analysis methods.
This is the case of an Italian startup, Netabolics, which has taken up the challenge of digitizing human cells in order to predict the effects of newly developed drugs on cellular biology.
The AI-based platform created by Netabolics predicts the biological changes occurring inside any cell type in response to pharmacological, genetic, and environmental factors (in real time).
Even if these predictions are not always perfect, they are able to inform decision-making in drug discovery, accelerating the progress towards safe and effective medicines for patients.
This technology uses enzyme/receptor kinetics, principles of physics and biology, and deep reinforcement learning (that is a branch of ML) to automate biological systems and create pharmacology model simulations; in this way Netabolics is empowering drug discovery, identifying in a more effective way the biological targets of diseases.
Netabolics uses deep learning models to analyze large datasets of biological interactions, chemical properties of compounds, and clinical trial data to:
- Predict which drug candidates are most likely to succeed in later-stage clinical trials.
- Identify potential biomarkers of drug response or patient subtypes that might benefit from specific treatments.
- Reduce the time and cost involved in testing drug candidates by predicting outcomes based on prior data.