I often write code with python because it has lots of useful packages, documents and community. And first programming language which I learned is python.
So I've never wrote chemoinformatics code without python. But I have interested in coding with C++ / Rust because it works very fast.
Today, I tried to wrote code with C++ because I found great post about writing C++ chemoinformatics code from Greg! The URL is below.
https://greglandrum.github.io/rdkit-blog/tutorial/technical/2021/07/24/setting-up-a-cxx-dev-env.html
Computational cost of Finding Maximum common substructure(MCS) is high. RDKit has useful function FindMCS which is developed by Andrew Dalke. It's really cool function and I wonder if the code works faster when I run it from C++.
I wrote some code for getting MCS. OK let's see it....
At first, I wrote find_mcs.cpp. I'm new bee of c++ so most of code is came from Greg's post(I appreciate his great work).
#include <unistd.h> #include <fcntl.h> #include <sys/time.h> #include <sys/resource.h> #include <cstdio> #include <cstring> #include <ctime> #include <string> #include <iostream> #include <GraphMol/RDKitBase.h> #include <GraphMol/FileParsers/FileParsers.h> //MOL single molecule ! #include <GraphMol/FileParsers/MolSupplier.h> //SDF #include <GraphMol/FMCS/FMCS.h> #include <GraphMol/FileParsers/MolSupplier.h> using namespace RDKit; void get_mcs(std::string pathName, unsigned int maxToDo, std::vector<ROMOL_SPTR> &mols) { RDKit::SDMolSupplier suppl(pathName, true, true, true); unsigned int nDone = 0; while (!suppl.atEnd() && (maxToDo <= 0 || nDone < maxToDo)) { RDKit::ROMol *m = suppl.next(); if (!m) { continue; } m->updatePropertyCache(); // the tautomer hash code uses conjugation info MolOps::setConjugation(*m); nDone += 1; mols.push_back(ROMOL_SPTR((ROMol *)m)); } std::cerr << "read: " << nDone << " mols." << std::endl; MCSParameters p; MCSResult res = findMCS(mols, &p); std::cout << "\n" << "MCS: " << res.SmartsString << " " << res.NumAtoms << " atoms, " << res.NumBonds << " bonds\n"; } int main(int argc, char *argv[]) { std::vector<ROMOL_SPTR> mols; get_mcs(argv[1], 10000, mols); }
Next, I worte CMakeLists.txt.
cmake_minimum_required(VERSION 3.18) project(simple_cxx_example) set(CMAKE_CXX_STANDARD 14) set(CMAKE_CXX_STANDARD_REQUIRED True) find_package(RDKit REQUIRED) find_package(Boost COMPONENTS timer system REQUIRED) add_executable(find_mcs find_mcs.cpp) target_link_libraries(find_mcs RDKit::FileParsers RDKit::FMCS Boost::timer)
Now ready. Before compiling the code, I wrote python script to do same task as C++ code.
import sys from rdkit import Chem from rdkit.Chem import rdFMCS def get_mcs(mols): mcs = rdFMCS.FindMCS(mols) return mcs if __name__=='__main__': mols = Chem.SDMolSupplier(sys.argv[1]) mcs = get_mcs(mols) print(mcs.smartsString)
Almost done Next, I compiled cpp file and run it.
$ mkdir build $ cd build $ cmake .. $ make find_mcs
After running the code shown above, I could get find_mcs file.
OK let's compare C++ and Python code. I used cdk2.sdf as test file. As you can see find_mcs implemented c++ works faster than python implementation.
(chemoinfo) iwatobipen@penguin:~/dev/sandbox/cpp/build$ time ./find_mcs cdk2.sdf read: 47 mols. MCS: [#6](:,-[#7]:,-[#6]):,-[#6]:,-[#6] 5 atoms, 4 bonds real 0m0.025s user 0m0.017s sys 0m0.008s (chemoinfo) iwatobipen@penguin:~/dev/sandbox/cpp/build$ time python find_mcs.py cdk2.sdf [#6](:,-[#7]:,-[#6]):,-[#6]:,-[#6] real 0m0.177s user 0m0.480s sys 0m1.368s
It's interesting and exciting results for me. I'll learn c++ more and more.
Today's code is pushed my github repo. Any suggestions, advice are apprecaited.
https://github.com/iwatobipen/rdkit_cpp
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.