Cheminformatics in Excel part 2: RDKit4Excel
Previously I wrote about how to use the xlwings project to get RDKit functionality directly in Excel. Well, it wasn’t that directly as there were some intermediate VBA code that XLwings automatically added. It later turned out that there could be troubles with sharing workbooks and the continuous update of internal VBA for the workbooks. This gave some issues in deployment. Together with Jan Holst Jensen of Biochemfusion fame, I developed RDKit4Excel which skips the intermediate VBA and instead uses a python server as a programmable COM service.
Excel has the possibility to use programmable services from COM objects directly in formulas, which is the base for many add-ins and plugins. After proper registration of the COM service in the windows registry, the service can be added to Excel and the code executed directly from a formula in a cell. Sound convenient, but how is it done in Python?
This is possible in Python via the win32com module. If you search on Google for “win32com python excel” you will get a lot of hits from people trying to access Excel from within Python scripts. This can indeed be useful to copy out graphs, reading values and interact with the Excel workbooks. However the module can also be used the other way round: To program python class and register it as a COM service, which can then be used this from within Excel. The information about how to use the win32com module is mostly collected in the book “Python Programming On Win32”, whereas the online documentation seem more scattered and scarce.
Python as an Excel Add-In
In order to make the COM server, there is a need to define the interface in an interface definition library (.IDL) file. This file contains information about what functions are available in the COM service, what inputs are needed and what the function returns. There is some clash of programming paradigms here. Python has dynamically typed variables, so if I need a float I simply write a = 1.0, wheras an integer variable can be defined as a=1. This is convenient but can lead to some gotchas, as an example by doing integer division unintentionally in Python 2:
>>> 3/10 >>> 3/10.0 0.3
However, the variable types need to be defined in the IDL file otherwise Excel will not know what to do, so this can unfortunately not be fully automated. Instead we wrote a py2idl.py script, which looks in the servers python code for a special comment tag #RDKITXL: and uses the comment to parse the input and output variable types and names to include in the IDL file, which can then be automatically created. So to enable the server with a function that gets a smiles string and returns a molfile block (useful for depiction by other plugins), one would add the following to the com32 server class:
#RDKITXL: in:smiles:str, out:str def rdkit_smiles_to_molblock(self, smiles): self.rdkit_info_num_calls = self.rdkit_info_num_calls +1 # win32api.OutputDebugString(str(type(smiles))) smiles = dispatch_to_str(smiles) mol = Chem.MolFromSmiles(smiles) if mol != None: # Add coords for depiction. AllChem.Compute2DCoords(mol) return Chem.MolToMolBlock(mol) else: return 'ERROR: Cannot parse SMILES input.'
The comment must start with RDKITXL and contains information to be included in the IDL file. It defines that the function takes a string named smiles and returns a string. The rest is standard RDKit python code to convert a SMILES string to a mol and return the Molfile text as a string. We would have loved it to be fully automated by parsing the Python code, but as the variables are not declared as specific types, this information had to be included in the comment.
Autogeneration of IDL and registering
The RDKitXL_server.py contains both the server class “CRDKitXL” that will contain the public functions, but also the code necessary to call the autogeneration of the .IDL file as well as compile and register this as a typelib (.TLB). I especially like the line
genfile = generate_idl(__file__, generatefile="RDKitXL.idl")
where the python script is asked to parse itself.
So installation is pretty easy once the prerequisites are installed. When the wanted functions have been added to the server class and properly commented, simply register the server by running “python RDKitXL_server.py” from the command line as administrator. This is as simple as we could make it. Next step is much more complicated to describe, as the COM service must be made an Add-In in Excel via some clickety-click procedure through menu’s: Please refer to the install file for updated procedures.
RDKit in Excel!
Now the server functions can be accessed from Excel: As a showcase we added a dynamic function that takes a smiles string and a descriptor name from the RDKit descriptors module, calculates the descriptor and returns the value: = rdkit_descriptor(“CCCC”,”MolLogP”). This was used to build a small demo workbook illustrating how one could check if Lipinski’s Rule of Five is violated.