A deeper look into chemical space with neural autoencoders
In the last blogpost the battle tested principal components analysis (PCA) was used as a dimensionality reduction tool. This time we’ll take a deeper look into chemical space by using a deep learning neural autoencoder, by testing some of the newer tools based on neural networks which has shown promising results. Fundamentally it’s based on the autoencoder idea, a net of neurons compresses the information in the input layer to a reduced size at a bottleneck and then expands the compressed information again. The network is then trained to reconstruct the input as closely as possible. The hope is that the information at the bottleneck will be some some essential features best reflecting the variation in the dataset, which can be used for plotting/visualization or further QSAR modelling etc.
Recently such an approach was applied to molecular data with the input being SMILES strings by Rafael Gomez-Bombarelli et al. The single line text molecular representations are processed using convolutional neural layers to a bottleneck, and then expanded using gated recurrent unit layers which are efficient at generating strings. I’ve previously shown how recurrent networks with LSTM cells are efficient as generators of SMILES string, but this time the Smiles generation is controlled by the information compressed into the bottleneck layer. The architecture from the paper has already been recreated and released using Keras and Theano in Python, so it’s straightforward to take it for a test-drive.
The implementation is created as a couple of python scripts taking arguments from the command line. I had some issues with the combined “download and convert to hdf5
format” script so i did this step manually
wget http://zinc.docking.org/db/bysubset/13/13_p0.smi.gz
import pandas filename = '13_p0.smi.gz' outfile = 'data/zinc13.h5' df = pandas.read_csv(filename, delimiter = ' ', compression='gzip', names=['structure','ZincID'], ncols=50000) df.to_hdf(outfile, 'table', format = 'table', data_columns = True)
The preprocessing script splits the dataset into a train and a test set, pads the smiles strings to 120 character length, builds a charset and one hot encodes the characters using the custom charset. This expands the data dimension from strings to a 120xcharset length bit matrices which are used as input to the convolutional neural layers.
The network is then trained with the train script. I choose to reduce the bottleneck to only 3 units to force a high compression, the default is 292. This will limit the maximum accuracy and also the quality of the output, but may be more suited for plotting and introspection of the learned features.
python train.py data/preprocessed_50k.h5 model_50k.h5 --epochs 200 --latent-dim 3
With only 50k samples the number of epochs needs to be high, but the low number of units in the bottleneck prevents against over fitting. After around 150 epochs the validation loss as well as the training loss plateaued off around 2.5 and the reconstruction accuracy ~0.85. This is not as good as the values that can be obtained using a higher number of latent dimensions, which will give some troubles as we will see later.
Using the sample.py script the autoencoder can be tested. It can either target the encoder, so from a preprocessed datafile it will output the values at the bottleneck for each molecule in the test set, or alternatively give the decoder some values and the output will be the constructed smiles. Default is to target the whole autoencoder so the input SMILES can be compared with the output. By default it only takes the first molecule but it was easy to change the python code. First line is the input SMILES, and the second is the generated one after being through a bottleneck of just three numbers.
python sample.py data/preprocessed_50k.h5 model_50k_3dim.h5 --latent_dim 3
Cc1ccc(cc1)c2c(sc(n2)NC(=O)COC)C Ccccccccc1)c2ccc(cc2)C)C(=O)[O-] c1ccc(c(c1)C(=O)Nc2cc(cc(c2)Cl)Cl)O C1ccccccc1)C(=O)Nc2ccc(cc2)C(F)(F)F Cc1c(sc(n1)NC(=O)[C@@H]2C[C@H]3CC[C@@H]2C3)C(=O)C C11cccccc1)NC(=O)[C@@H]CCCC@@]]CCCCCC))CCCCCO)CC)
The network seem to put a great deal of focus on the lenght of the SMILES strings. This is not surprising when one considers that the strings were padded with spaces to get a length of 120. By getting the padded spaces right, the loss function can easily be significantly lowered. The autoencoder also seems to put a lot of emphasis on getting the position of numbers right, which is kind of cool as this reflects the ring topology of the molecules. However, the index in the string is recreated, not the number of atoms between the ring closers. This can as an example be seen in test two, where the input benzene ring in the beginning is turned into an aromatic ring with 7 aromatic and 1 aliphatic carbon, while retaining the positions of the ring closure numbers. The network though gets 7 out of 10 characters correct, matching the accuracy of 0.85 observed during training. Some places the network correctly reconstructs local features, and its kind of interesting how the chlorine atoms in example 2 is reconstructed as a triflouro methyl group, so is there some latent concept of halogens coded into the network after training? I’m just speculating here.
On the downside, Branching never seems to get right, and there is an overweight of closing parentheses. The liberal use of aromatic carbon as what seems as the default guess in rings also often prevents parsing of the SMILES strings into proper molecules in RDKit.
Overall its however a surprisingly efficient round trip from smiles string to 3 numbers and back to a smiles like string!
Lets try out the encoder
python sample.py data/preprocessed_50k.h5 model_50k_3dim.h5 --latent_dim 3 --target encoder --save_h5 traindata_encoded.h5
And after fiddling around with some custom code to ensure that the DHFR dataset I used last time for the PCA was one hot encoded with the same charset as the traindata.
python sample.py data/DHFR_preprocessed.h5 model_50k_3dim.h5 --latent_dim 3 --target encoder --save_h5 DHFR_encoded.h5
Theres some interesting striations visible. However, I had hoped the DHFR dataset would be more focused in the plots, but it is more or less comparable to the plots obtained with the PCA algorithm in in the previous blog post. Maybe it needs more tuning. On the other hand the extra features like steered sampling and molecular generation as well as the creative potential risen with the use of generative networks, makes these neural networks very interesting to experiment with.
- Pros:
- Can handle very large datasets with batch training
- Can generate smiles in vicinity of existing compound or interpolate between molelcules in a test set
- Opens up for interesting use of neural architecture to model and optimize molecular properties
- non-linear compression may give more efficient and essential features
- Cons:
- Slow to train
- Smiles may not be well formed (a matter of tuning?)
- Focuses on reconstruction the structure of the smile, not necessarily reflecting the chemistry