I am calculating the structure similarity profile between 2 moles using rdkit
. When I am running the program in google colab (rdkit=2020.09.2
python=3.7
) the program is working fine.
I am getting an error when I am running on my PC (rdkit=2021.03.2
python=3.8.5
). The error is a bit strange. The dataframe contains 500
rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
ValueError: BitVects must be same length
The block of code is given below
data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
#Proff and make a list of Smiles and id
c_smiles = []
count = 0
for index, row in data.iterrows():
try:
cs = Chem.CanonSmiles(row['SMILES'])
c_smiles.append([row['ID_Name'], cs])
except:
count = count + 1
print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])
# make a list of id, smiles, and mols
ms = []
df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
for index, row in df.iterrows():
mol = Chem.MolFromSmiles(row['SMILES'])
ms.append([row['ID_Name'], row['SMILES'], mol])
# make a list of id, smiles, mols, and fingerprints (fp)
fps = []
df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
df_fps.head
for index, row in df_fps.iterrows():
fps_cal = FingerprintMols.FingerprintMol(row['mol'])
fps.append([row['ID_Name'], fps_cal])
fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
fps_2 = fps_2[fps_2.columns[1]]
fps_2 = fps_2.values.tolist()
# compare all fp pairwise without duplicates
for n in range(len(fps_2)):
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
for m in range(len(s)):
qu.append(c_smiles2[n])
ta.append(c_smiles2[n+1:][m])
sim.append(s[m])
Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2
?
Reproducible Data
DB00607 [H][C@]12SC(C)(C)[C@@H](N1C(=O)[C@H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C@@H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C@@]12[C@H]3CC[C@H](C3)[C@]1([H])C(=O)N(C[C@@H]1CCCC[C@H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C@]1(C)CN(C[C@@]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C@H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1
To answer first on how to install a specific version of Rdkit, you can run this command:
conda install -c rdkit rdkit=2020.09.2
Coming to the original question, the error is coming because of the function:
FingerprintMols.FingerprintMol()
For whatever internal reasons, it's converting the first 10 SMILES to a 2048
length vector while the 11th SMILES to a 1024
length vector. The older versions are able to handle this mismatch but newer versions can't. There are two options to fix this:
FingerprintMols.FingerprintMol(row['mol'])
with
FingerprintMols.FingerprintMol(row['mol'], minPath=1, maxPath=7, fpSize=2048,
bitsPerHash=2, useHs=True, tgtDensity=0.0,
minSize=128)
In the replacement, all arguments other than fpSize
are set to their default values and fpSize
is fixed to 2048. Please note that you must pass all the arguments and not just fpSize
.