I have a directory full of 60,000 files that are named by their molid. I have a second file in CSV format that has molids in column 1 with their respective CHEMBLID in column 2. I need to match the file name molid in the directory with a molid in the CSV file. If a match is found, the chemblid is added to the file (the file is rewritten to have the chemblid included). I am also using RDKit to calculate some properties I need written to the modified file too. I need to find a way to optimize this since I have to run it on 2 million files soon.
The way I am interacting with arg parse is by listing all of the molid.sdf files in my directory with the bash parallel command.
The csv file looks like:
molid,chembl
319855,CHEMBLtest
187481,CHEMBL1527718
https://www.dropbox.com/s/6ynd9vbwwf6lqka/output_2.csv?dl=0
The file that needs to be modified looks like:
298512 from gamess16 based ATb pipeline
OpenBabel06141815083D
34 35 0 0 1 0 0 0 0 0999 V2000
4.3885 -1.0129 1.6972 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3885 -0.7157 0.5784 C 0 0 2 0 0 0 0 0 0 0 0 0
3.6479 -1.5425 -0.5699 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4599 0.7380 0.1087 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4770 1.0889 -1.0314 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0165 0.9826 -0.6438 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3679 2.0729 0.0029 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9531 1.9980 0.3853 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7151 0.8214 0.1489 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.0800 0.7051 0.5321 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.8067 -0.4453 0.2969 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2581 -0.5636 0.6988 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1581 -1.5376 -0.3496 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8397 -1.4605 -0.7357 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0762 -0.2830 -0.5007 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2871 -0.1675 -0.8844 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1834 -0.3978 2.5815 H 0 0 0 0 0 0 0 0 0 0 0 0
5.4123 -0.8100 1.3616 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3301 -2.0654 2.0016 H 0 0 0 0 0 0 0 0 0 0 0 0
2.3709 -0.9175 0.9451 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4809 -2.4622 -0.3076 H 0 0 0 0 0 0 0 0 0 0 0 0
3.2757 1.3897 0.9729 H 0 0 0 0 0 0 0 0 0 0 0 0
4.4830 0.9450 -0.2346 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6837 0.4273 -1.8785 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6901 2.1132 -1.3637 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9314 2.9850 0.1903 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.4318 2.8450 0.8726 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.5539 1.5524 1.0253 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9075 -0.6633 -0.1810 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.4288 -1.4505 1.3221 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.5904 0.3146 1.2616 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.7228 -2.4486 -0.5381 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.3620 -2.3059 -1.2268 H 0 0 0 0 0 0 0 0 0 0 0 0
0.7671 -1.0133 -1.3738 H 0 0 0 0 0 0 0 0 0 0 0 0
1 19 1 0 0 0 0
1 17 1 0 0 0 0
2 20 1 1 0 0 0
2 1 1 0 0 0 0
3 21 1 0 0 0 0
3 2 1 0 0 0 0
4 2 1 0 0 0 0
4 22 1 0 0 0 0
5 6 1 0 0 0 0
5 4 1 0 0 0 0
6 7 1 0 0 0 0
7 26 1 0 0 0 0
7 8 2 0 0 0 0
8 27 1 0 0 0 0
9 8 1 0 0 0 0
9 10 2 0 0 0 0
10 28 1 0 0 0 0
11 10 1 0 0 0 0
11 12 1 0 0 0 0
12 31 1 0 0 0 0
12 30 1 0 0 0 0
13 11 2 0 0 0 0
14 15 2 0 0 0 0
14 13 1 0 0 0 0
15 9 1 0 0 0 0
16 6 2 0 0 0 0
16 15 1 0 0 0 0
18 1 1 0 0 0 0
23 4 1 0 0 0 0
24 5 1 0 0 0 0
25 5 1 0 0 0 0
29 12 1 0 0 0 0
32 13 1 0 0 0 0
33 14 1 0 0 0 0
34 16 1 0 0 0 0
M END
> <molid>
298512
$$$$
https://www.dropbox.com/s/9r9kandkbahgexj/298512.sdf?dl=0
And a modified file with how the current script works looks like:
298512 from gamess16 based ATb pipeline
RDKit 3D
34 35 0 0 1 0 0 0 0 0999 V2000
4.3885 -1.0129 1.6972 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3885 -0.7157 0.5784 C 0 0 2 0 0 0 0 0 0 0 0 0
3.6479 -1.5425 -0.5699 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4599 0.7380 0.1087 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4770 1.0889 -1.0314 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0165 0.9826 -0.6438 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3679 2.0729 0.0029 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9531 1.9980 0.3853 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7151 0.8214 0.1489 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.0800 0.7051 0.5321 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.8067 -0.4453 0.2969 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2581 -0.5636 0.6988 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1581 -1.5376 -0.3496 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8397 -1.4605 -0.7357 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0762 -0.2830 -0.5007 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2871 -0.1675 -0.8844 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1834 -0.3978 2.5815 H 0 0 0 0 0 0 0 0 0 0 0 0
5.4123 -0.8100 1.3616 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3301 -2.0654 2.0016 H 0 0 0 0 0 0 0 0 0 0 0 0
2.3709 -0.9175 0.9451 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4809 -2.4622 -0.3076 H 0 0 0 0 0 0 0 0 0 0 0 0
3.2757 1.3897 0.9729 H 0 0 0 0 0 0 0 0 0 0 0 0
4.4830 0.9450 -0.2346 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6837 0.4273 -1.8785 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6901 2.1132 -1.3637 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9314 2.9850 0.1903 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.4318 2.8450 0.8726 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.5539 1.5524 1.0253 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9075 -0.6633 -0.1810 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.4288 -1.4505 1.3221 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.5904 0.3146 1.2616 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.7228 -2.4486 -0.5381 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.3620 -2.3059 -1.2268 H 0 0 0 0 0 0 0 0 0 0 0 0
0.7671 -1.0133 -1.3738 H 0 0 0 0 0 0 0 0 0 0 0 0
1 19 1 0
1 17 1 0
2 20 1 1
2 1 1 0
3 21 1 0
3 2 1 0
4 2 1 0
4 22 1 0
5 6 1 0
5 4 1 0
6 7 1 0
7 26 1 0
7 8 2 0
8 27 1 0
9 8 1 0
9 10 2 0
10 28 1 0
11 10 1 0
11 12 1 0
12 31 1 0
12 30 1 0
13 11 2 0
14 15 2 0
14 13 1 0
15 9 1 0
16 6 2 0
16 15 1 0
18 1 1 0
23 4 1 0
24 5 1 0
25 5 1 0
29 12 1 0
32 13 1 0
33 14 1 0
34 16 1 0
M END
> <molid> (1)
298512
> <CHEMBLID> (1)
CHEMBL3278713
> <i_user_TOTAL_CHARGE> (1)
0
> <SMILES> (1)
'[H]OC([H])(C([H])([H])[H])C([H])([H])C([H])([H])C1C([H])=C([H])C2=C([H])C(=C([H])C([H])=C2C=1[H])C([H])([H])[H]'
> <InChI> (1)
'InChI=1S/C15H18O/c1-11-3-7-15-10-13(5-4-12(2)16)6-8-14(15)9-11/h3,6-10,12,16H,4-5H2,1-2H3/t12-/m0/s1'
$$$$
https://www.dropbox.com/s/dfcmiv7d298s1fl/298512.chembl.sdf?dl=0
import os,shutil,csv
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-molid", help="molids from file names", type=str)
args = parser.parse_args()
print(args)
fn = args.molid
print(fn)
suppl = Chem.SDMolSupplier(fn, removeHs=False, sanitize=False)
ms = [x for x in suppl if x is not None] #sanity check loop to make sure the files were read
print("This is the number of entries read in")
print(len(ms))
print(len(suppl))
w=Chem.SDWriter('totaltest_with_chembl.sdf') #writes new file with all of the chemblid
new_files_with_chembl_id=os.path.splitext(fn)[0]
w=Chem.SDWriter(new_files_with_chembl_id+'.chembl.sdf')
molid_chemblid = open('output_2.csv','r')
csv_f = csv.reader(molid_chemblid)
header = next(csv_f)
molidIndex = header.index("molid")
chemblidIndex = header.index("chembl")
molid_chemblidList = []
for line in csv_f:
molid = line[molidIndex]
# print(molid)
chembl = line[chemblidIndex]
# print(chembl)
molid_chemblidList.append([molid,chembl])
for m in suppl: #molecule in MoleculeSet
print(m)
atbname=m.GetProp("_Name")
fillmein=atbname.split( )[0]
moleculeCharge=Chem.GetFormalCharge(m)
smiles_string=Chem.MolToSmiles(m)
inchi_string=Chem.MolToInchi(m)
print("molecularCharge")
print(moleculeCharge)
print("smile_string")
print(smiles_string)
print("inchi string")
print(inchi_string)
if fillmein == molid:
print(chembl)
print(chembl)
print(line)
print("this is line in molid_chemblid",line)
m.SetProp("CHEMBLID",chembl)
m.SetProp("i_user_TOTAL_CHARGE",repr(moleculeCharge))
m.SetProp("SMILES",repr(smiles_string))
m.SetProp("InChI",repr(inchi_string))
w.write(m)
The molid in the CSV file sounds like a unique key. Read the CSV file into a map/associative array where the molid is the key and the rest of the row is the value, parsed or not as you need it. Python has builtin CSV parsers with import csv
.
Then loop just once over the files, find the chemblid by looking up the molid from the file name in the map.
This reduces the overall effort to roughly k*N where N is the number of files and molids and k is a small number.
Your algorithm has a loop within a loop which makes it N*N complexity. This indeed would take some time with N=2 million :-)
2 Million files is still a lot and may take between a few hours and a few days, depending how big the files are, how fast your disk access is and all that. Running a few threads in parallel will then help, until the I/O saturates. But test first, since implementing a parallel approach can get complicated. If you have to get through this only once, it might be ok to wait a bit longer.