Building a simple QSAR model using a feed forward neural network in PyTorch
In my previous blogposts I’ve entirely been using Keras for my neural networks. Keras as a stand-alone is now no longer active developed, but are instead now an integrated part of TensorFlow, which are developed and open-sourced by Google. Pytorch is another framework for building artificial neural network models, which came out of Facebooks AI research and was open-sourced. In this blog-post I show how to build a simple feed-forward neural network in PyTorch for making a quantitative structure-activity relationship model (QSAR).
Download data and prepare data
ExcapeDB (https://solr.ideaconsult.net/search/excape/) is a project developed as part of the BigChem project. It’s a treasure trove of bioactivity data and a nice feature is the possibility to download bulk datasets from the interface after setting search critera. I downloaded the active molecules that had an associated pXC50 value of the SLC6A4 gene. I set the Gene Symbol to SLC6A4 and “Active” in results from the interface and download the dataset in CSV format. The downloaded file was renamed to SLC6A_active_excape_export.csv. SLC6A4 is a serotonine transporter that transport serotonine from the synaptic cleft back into the neurons. It’s a target of many antidepressants, which is likely why there’s collected a lot of measurements on the activity of different compounds.
I’ll import some libraries to use later and use pandas to load the CSV file and manage the data
import numpy as np import pandas as pd import matplotlib.pyplot as plt import torch import torch.nn as nn from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split
data = pd.read_csv("SLC6A4_active_excape_export.csv") data.head(1)
Ambit_InchiKey | Original_Entry_ID | Entrez_ID | Activity_Flag | pXC50 | DB | Original_Assay_ID | Tax_ID | Gene_Symbol | Ortholog_Group | SMILES | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | DFPSUOZWGQDKPI-COPLHBTANA-N | 45380180 | 6532 | A | 8.40012 | pubchem | 462683 | 9606 | SLC6A4 | 4061 | ClC=1C=C([C@]23[C@H]([C@@H]2CN(C)C)CNC3)C=CC1Cl |
Loading the CSV file gives us a dataframe with the information. The interesting columns for building the QSAR model in PyTorch is the structure, which are encoded in the SMILES column, and the pXC50 which is the – log10 converted IC50 or EC50 values. Ki had been better, but this is what we have for working with.
I’ll use RDKit for conversion of the SMILES into molecular objects and for fingerprinting of the molecules. so we’ll start by converting the SMILES to RDKit molecular objects. The pandastools module contain convenient functions for creating a molecule column. The Molecule column is rendered when shown in the jupyter notebook, which is convenient when exploring the data.
from rdkit import Chem, DataStructs from rdkit.Chem import PandasTools, AllChem PandasTools.AddMoleculeColumnToFrame(data,'SMILES','Molecule') data[["SMILES","Molecule"]].head(1)
SMILES | Molecule | |
---|---|---|
0 | ClC=1C=C([C@]23[C@H]([C@@H]2CN(C)C)CNC3)C=CC1Cl |
As far as I know the ExcapeDB project used the Ambit toolkit and some of the SMILES may not be sanitizable by RDKit. I’ll check if some of the SMILES could not be converted by looking for None values in the Molecule column
data.Molecule.isna().sum()
0
All the SMILES appear to be converted for this dataset. Good.
Vectorizing the molecules
The Morgan fingerprint is a standard for fingerprinting molecules, but others can be tried as well. Calculating the fingerprints with RDKit allows us to convert the molecules into vectors of numbers that can be used for modelling. It’s fast to compute with a little helper function and as an example I’ll convert the fingerprint of molecule number 1 into a little image.
def mol2fp(mol): fp = AllChem.GetHashedMorganFingerprint(mol, 2, nBits=4096) ar = np.zeros((1,), dtype=np.int8) DataStructs.ConvertToNumpyArray(fp, ar) return ar fp =mol2fp(Chem.MolFromSmiles(data.loc[1,"SMILES"])) plt.matshow(fp.reshape((64,-1)) > 0)
Using the apply function of the pandas dataframe makes it easy to add a new column with the fingerprint vectors.
data["FPs"] = data.Molecule.apply(mol2fp)
Splitting and converting data to PyTorch tensors
PyTorch need to train on pytorch tensors, which are similar to Numpy arrays, but with some extra features such a the ability to be transferred to the GPU memory. But first we’ll devide the dataset into train, validation and test using scikit learn.
X = np.stack(data.FPs.values) print(X.shape)
(7228, 4096)
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split
y = data.pXC50.values.reshape((-1,1)) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42) X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.05, random_state=42) #Normalizing output using standard scaling scaler = StandardScaler() y_train = scaler.fit_transform(y_train) y_test = scaler.transform(y_test) y_validation = scaler.transform(y_validation)
I’ll remove some the bits with the lowest variance. They can be almost always on, or very seldom used. The VarienceThreshold class from scikit-learn is handy for just that. Just as for the normalization, we’ll compute the variance using the train set only, and use it to remove the same features from all the data sets.
# We'll remove low variance features from sklearn.feature_selection import VarianceThreshold feature_select = VarianceThreshold(threshold=0.05) X_train = feature_select.fit_transform(X_train) X_validation = feature_select.transform(X_validation) X_test = feature_select.transform(X_test) X_train.shape
(6179, 219)
Seems like most of the bits were seldom used. The threshold is a hyper-parameter of the entire pipeline for the model, and could be tuned. We’l move the datasets to the GPU and convert them to pytorch tensors.
# Let's get those arrays transfered to the GPU memory as tensors device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(device) # If you don't have a GPU, buy a graphics card. I have for a long time used a 1060 GTX, which is not that expensive anymore. X_train = torch.tensor(X_train, device=device).float() X_test = torch.tensor(X_test, device=device).float() X_validation = torch.tensor(X_validation, device=device).float() y_train = torch.tensor(y_train, device=device).float() y_test = torch.tensor(y_test, device=device).float() y_validation = torch.tensor(y_validation, device=device).float() X_train
cuda:0 tensor([[0., 0., 0., ..., 0., 0., 0.], [1., 0., 0., ..., 0., 0., 0.], [2., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 1., 1.]], device='cuda:0')
X_train.shape
torch.Size([6179, 219])
Pytorch work with datasets and dataloaders to feed the minibatches to the model during training, so we convert the data. It’s easy to create a dataset from the already created tensors. The batch size is set to 256 and the trainloader will shuffle the data when an epoch has been used.
from torch.utils.data import TensorDataset train_dataset = TensorDataset(X_train, y_train) validation_dataset = TensorDataset(X_validation, y_validation)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=256, shuffle=True) validation_loader = torch.utils.data.DataLoader(dataset=validation_dataset, batch_size=256, shuffle=False)
Good, with the training data ready it’s time for the fun part: Defining the network. In PyTorch this can be done by subclassing the nn.Module. The __init__ function will set up all the layers needed in the network, and the forward method will define how they are used in a syntax very similar to the Keras functional api: tensor = layer()(tensor). There’s no activation function for the final output, as this is a regression model, the linear activation is just fine.
class Net(nn.Module): def __init__(self, input_size, hidden_size, dropout_rate, out_size): super(Net, self).__init__() # Three layers and a output layer self.fc1 = nn.Linear(input_size, hidden_size) # 1st Full-Connected Layer self.fc2 = nn.Linear(hidden_size, hidden_size) self.fc3 = nn.Linear(hidden_size, hidden_size) self.fc_out = nn.Linear(hidden_size, out_size) # Output layer #Layer normalization for faster training self.ln1 = nn.LayerNorm(hidden_size) self.ln2 = nn.LayerNorm(hidden_size) self.ln3 = nn.LayerNorm(hidden_size) #LeakyReLU will be used as the activation function self.activation = nn.LeakyReLU() #Dropout for regularization self.dropout = nn.Dropout(dropout_rate) def forward(self, x):# Forward pass: stacking each layer together # Fully connected => Layer Norm => LeakyReLU => Dropout times 3 out = self.fc1(x) out = self.ln1(out) out = self.activation(out) out = self.dropout(out) out = self.fc2(out) out = self.ln2(out) out = self.activation(out) out = self.dropout(out) out = self.fc3(out) out = self.ln3(out) out = self.activation(out) out = self.dropout(out) #Final output layer out = self.fc_out(out) return out
With the network class defined we can set the hyperparameters and create the model.
#Defining the hyperparameters input_size = X_train.size()[-1] # The input size should fit our fingerprint size hidden_size = 1024 # The size of the hidden layer dropout_rate = 0.80 # The dropout rate output_size = 1 # This is just a single task, so this will be one learning_rate = 0.001 # The learning rate for the optimizer model = Net(input_size, hidden_size, dropout_rate, output_size)
The model can be moved to the GPU with the .cuda() method inherited from the nn.Module class. I’ve yet to look up how the module discovers the layers that were set as properties.
model.cuda()
Net( (fc1): Linear(in_features=219, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) (fc3): Linear(in_features=1024, out_features=1024, bias=True) (fc_out): Linear(in_features=1024, out_features=1, bias=True) (ln1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (ln2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (ln3): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (activation): LeakyReLU(negative_slope=0.01) (dropout): Dropout(p=0.8, inplace=False) )
For a loss function I’ll use the mean squared error. Adam is chosen for the optimizer.
criterion = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
Training is a bit more handheld than in Keras. This gives flexibility to train the network layers in different ways, but here we just train it with all the data. For each epoch, the running_loss is zeroed, and the minibatches fetched from the train data in the DataLoader. We start each mini-batch loop by zeroing the gradients. All forward computations done from this point on, will be recorded and used to calculate the backward pass needed for optimzing the weights of the network. At the end of each epoch, some statistics is printed so that it’s possible to follow the training.
model.train() #Ensure the network is in "train" mode with dropouts active epochs = 200 for e in range(epochs): running_loss = 0 for fps, labels in train_loader: # Training pass optimizer.zero_grad() # Initialize the gradients, which will be recorded during the forward pa output = model(fps) #Forward pass of the mini-batch loss = criterion(output, labels) #Computing the loss loss.backward() # calculate the backward pass optimizer.step() # Optimize the weights running_loss += loss.item() else: if e%10 == 0: validation_loss = torch.mean(( y_validation - model(X_validation) )**2).item() print("Epoch: %3i Training loss: %0.2F Validation loss: %0.2F"%(e,(running_loss/len(train_loader)), validation_loss))
Epoch: 0 Training loss: 1.22 Validation loss: 0.94 Epoch: 10 Training loss: 0.54 Validation loss: 0.58 Epoch: 20 Training loss: 0.44 Validation loss: 0.50 Epoch: 30 Training loss: 0.40 Validation loss: 0.49 Epoch: 40 Training loss: 0.36 Validation loss: 0.47 Epoch: 50 Training loss: 0.34 Validation loss: 0.48 Epoch: 60 Training loss: 0.32 Validation loss: 0.47 Epoch: 70 Training loss: 0.31 Validation loss: 0.48 Epoch: 80 Training loss: 0.30 Validation loss: 0.46 Epoch: 90 Training loss: 0.28 Validation loss: 0.44 Epoch: 100 Training loss: 0.27 Validation loss: 0.45 Epoch: 110 Training loss: 0.27 Validation loss: 0.42 Epoch: 120 Training loss: 0.25 Validation loss: 0.41 Epoch: 130 Training loss: 0.25 Validation loss: 0.44 Epoch: 140 Training loss: 0.25 Validation loss: 0.47 Epoch: 150 Training loss: 0.23 Validation loss: 0.42 Epoch: 160 Training loss: 0.22 Validation loss: 0.45 Epoch: 170 Training loss: 0.23 Validation loss: 0.40 Epoch: 180 Training loss: 0.22 Validation loss: 0.41 Epoch: 190 Training loss: 0.21 Validation loss: 0.41
Before using the model for prediction, it must be set to evalulation mode, where the dropout layers are no longer active. We can calculate the prediction on the training set, validation set and the external test set.
model.eval() #Swith to evaluation mode, where dropout is switched off y_pred_train = model(X_train) y_pred_validation = model(X_validation) y_pred_test = model(X_test)
Calculating the root mean square error can be done using the pytorch operators. The .item() method converts single element tensors to Python scalers for printing.
torch.mean(( y_train - y_pred_train )**2).item()
0.1252715289592743
torch.mean(( y_validation - y_pred_validation )**2).item()
0.345264732837677
torch.mean(( y_test - y_pred_test )**2).item()
0.3195636570453644
The error on the held out test set is actually lower than the validation set, although both are much worse than the train set. This indicates overfitting, even though the dropout rate was set high at 0.8.
Finally, lets evaluate the model visually. Before that is possible there seem to be a need to transport the tensor to cpu, detach it from gradient before we can can convert it to numpy and flatten it.
def flatten(tensor): return tensor.cpu().detach().numpy().flatten()
plt.scatter(flatten(y_pred_test), flatten(y_test), alpha=0.5, label="Test") plt.scatter(flatten(y_pred_train), flatten(y_train), alpha=0.1, label="Train") plt.legend() plt.plot([-1.5, 1.5], [-1.5,1.5], c="b")
If we want to predict an arbitrary molecule, we have to remember to use the VarianceThreshold object on the fingerprints and also reverse scale the prediction from the feed forward neural network. The scikit_learn objects are already fit and can be used here.
def predict_smiles(smiles): fp =mol2fp(Chem.MolFromSmiles(smiles)).reshape(1,-1) fp_filtered = feature_select.transform(fp) fp_tensor = torch.tensor(fp_filtered, device=device).float() prediction = model(fp_tensor) #return prediction.cpu().detach().numpy() pXC50 = scaler.inverse_transform(prediction.cpu().detach().numpy()) return pXC50[0][0] predict_smiles('Cc1ccc2c(N3CCNCC3)cc(F)cc2n1')
8.663998
If you read this far, I hope you found it useful. There’s plenty to play around with to improve the model. The hyperparameters could be tuned (such as learning-rate, batch size, size of the hidden layers, activation functions, dropout or regularization with l2 etc.) But experiments with other fingerprints could be also be tried, as well as the feature selection step could be varied.
I have worked through your excellent example of using pytorch and got to the end and tried the prediction for the given smiles string, but got the following error:
ValueError: X has a different shape than during fitting.
The shape of the fingerprint after reshaping is (4096, 1), but the feature_select.transform(fp) is where the error is happening. Any thoughts as to why?
Thanks for your interest and comment. Scikit-Learn usually have samples as dimension 0, and features as 1, so the array need the shape (1, 4096). Check that the variable dimension is second in your reshape call .reshape(1,-1).
Pingback: A simple LSTM based QSAR model in PyTorch | Cheminformania
how come rdkit working with python 3 !!
Not sure I understand your question correctly, RDKit has been python 3 for some time now (think 2014 or so)?
Thank you for your blog, I have found it very interesting and informative. I would like to know, does RDKit calculate logD as a feature? I have been searching for it but can’t seem to find it. If so, which module/function in RDKit calculates it? If not, what could I use to calculate it?
Thanks for your question. Although this has nothing to do with the blog-post, I believe you may be looking for https://www.rdkit.org/docs/source/rdkit.Chem.Crippen.html. You could probably get a much more detailed answer by asking your question on the RDKit discuss mailing list: https://sourceforge.net/p/rdkit/mailman/
Thank you for such a nice blog. There is one part I don’t quite understand.
As regards to validation_loader, I don’t how and where you use the output for calculating the prediction metrics.
Ah, yes, I see why you may be confused. I don’t use the validation_loader in this example, as the dataset is already available as X_validation and y_validation and fits in memory. So instead of looping through mini-batches from validation loader, I simply make the predictions in one go. If the dataset would not fit in memory during prediction it would have to be done on in mini-batches, or you’d switch to a framework like pytorch-lightning, you would need to provide the validation_loader from the data module.
Any suggestion on how i can tune it?the examples I find focus on image classification and I am having some trouble to transfer the procedure to this model-
Hi, I’ve been busy so maybe you figured it out. But there’s a couple of packages out there that can help you with a bit more than simply random tuning or grid search. As example there’s Optuna and the derivate Qptuna more targeted for molecule models. There’s also hyperopt and gpyopt. I used the later in this blogpost https://www.cheminformania.com/tune-your-deep-tox-neural-network-with-free-tools/
Thanks!I will check it!
Hi, I really enjoyed this tutorial, thanks! I am wondering if you have the code somewhere available to download?
No, I don’t have a ready to run code for download. This is for inspiration for your own scripts.