Update to Scikit-Mol: the power of community and open-source
We’ve just updated Scikit-Mol[1] to version 0.3.0. Scikit-Mol was covered in a previous blogpost. The big news in this update is the support for pandas output (and input), which is a beautiful illustration of the power of the open-source approach.
It all started when I was contacted by Enrico Gandini from Italy. He made me aware that Scikit-Mol did not support the Pandas output feature that became available with Scikit-Learn 1.2. Not only did he make me aware of a cool feature I didn’t know about in Scikit-Learn, but he also helped fix the issue in Scikit-Mol.
This feature is super powerful. By setting a global setting in Scikit-Learn or on a transformer object, transformer objects begin to output Pandas DataFrames instead of NumPy arrays, which allows for tracking features in a whole new way. The Pandas input also allows for the selection of columns to process with transformers, enabling the pipeline to, for example, work on Pandas DataFrames with existing or precomputed features, but use the molecule column to supplement the DataFrame with new descriptors before the final Scikit-Learn model.
Let me quickly demonstrate it.
!pip install --upgrade scikit-mol import scikit_mol print(f"Scikit-Mol version: {scikit_mol.__version__}") import sklearn print(f"Scikit-Learn version: {sklearn.__version__}")
Scikit-Mol version: 0.3.0 Scikit-Learn version: 1.3.1
import pandas as pd data = pd.DataFrame({"Smiles": ["COC", "CCN", "Cc1ccccc1"]}) data
Smiles | |
---|---|
0 | COC |
1 | CCN |
2 | Cc1ccccc1 |
from scikit_mol.conversions import SmilesToMolTransformer from scikit_mol.descriptors import MolecularDescriptorTransformer from sklearn.pipeline import make_pipeline descriptors_pipeline = make_pipeline( SmilesToMolTransformer(), MolecularDescriptorTransformer(desc_list=["MolWt", "qed", "MolLogP", "NumAromaticRings", "NumHAcceptors", "NumHDonors"]), ) descriptors_pipeline.set_output(transform="pandas") descriptors_pipeline.transform(data)
MolWt | qed | MolLogP | NumAromaticRings | NumHAcceptors | NumHDonors | |
---|---|---|---|---|---|---|
0 | 46.069 | 0.380040 | 0.26260 | 0.0 | 1.0 | 0.0 |
1 | 45.085 | 0.406237 | -0.03500 | 0.0 | 1.0 | 1.0 |
2 | 92.141 | 0.458806 | 1.99502 | 1.0 | 0.0 | 0.0 |
This really makes it easier to understand what column is what feature. For a full example, take a look at Enrico’s excellent notebook: https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/10_pipeline_pandas_output.ipynb in the project’s repository. It does a very nice job of illustrating how easy it can be to expand upon existing features and perform feature importance analysis when features are labeled in the DataFrames.
I think this is a beautiful example of the open-source approach. We improved the project, I learned something, and I believe Enrico also learned something or at least had fun, and got his own need solved. All of us benefit 🙂 Big Kudos to Enrico for working enthusiastically on the pull-request and contributing.
PLEASE NOTE: We also changed the default output of the transformers to be more Scikit-Learn-like. Previously, we transformed lists to lists (or arrays), but now the transformers always output 2D data (as NumPy arrays or Pandas DataFrames) to align more closely with the API of Scikit-Learn transformers. This change may break some existing Scikit-Mol scripts, but a `.flatten()` can quickly fix this (or you can revert to Scikit-Mol version 0.2.1 by running `pip install scikit-mol==0.2.1` until you have time to address it). The transformers now also accept 2D input (as demanded by Scikit-Learn), but we have retained support for 1D-lists as well. Most of the transformers only accept one column with either SMILES strings or RDKit molecular objects, but the ColumnTransformer from Scikit-Learn can be used to select the relevant SMILES or Mol column in a pipeline.
Let me know in the comments below if you find some useful way or a cool application of Scikit-Mol or this new feature.
Happy Machine Learning and Molecular Modeling,
/Esben
References
[1] Bjerrum, Esben Jannik, Rafał Adam Bachorz, Adrien Bitton, Oh-hyeon Choung, Ya Chen, Carmen Esposito, Son Viet Ha, and Andreas Poehlmann. “Scikit-Mol Brings Cheminformatics to Scikit-Learn.” ChemRxiv, December 6, 2023. https://doi.org/10.26434/chemrxiv-2023-fzqwd.