Search code examples
pythonpython-3.xfingerprintrdkit

ValueError: BitVects must be same length (rdkit)


I am calculating the structure similarity profile between 2 moles using rdkit. When I am running the program in google colab (rdkit=2020.09.2 python=3.7) the program is working fine.

I am getting an error when I am running on my PC (rdkit=2021.03.2 python=3.8.5). The error is a bit strange. The dataframe contains 500 rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error

 s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:]) 
    ValueError: BitVects must be same length

The block of code is given below

  data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
 
  
  #Proff and make a list of Smiles and id
  c_smiles = []
  count = 0
  for index, row in data.iterrows():
    try:
      cs = Chem.CanonSmiles(row['SMILES'])
      c_smiles.append([row['ID_Name'], cs])
    except:
      count = count + 1
      print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])

  # make a list of id, smiles, and mols
  ms = []
  df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
  for index, row in df.iterrows():
    mol = Chem.MolFromSmiles(row['SMILES'])
    ms.append([row['ID_Name'], row['SMILES'], mol])

  # make a list of id, smiles, mols, and fingerprints (fp)
  fps = []
  df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
  df_fps.head

  for index, row in df_fps.iterrows():
    fps_cal = FingerprintMols.FingerprintMol(row['mol'])
    fps.append([row['ID_Name'], fps_cal])


  fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
  fps_2 = fps_2[fps_2.columns[1]]
  fps_2 = fps_2.values.tolist()


  # compare all fp pairwise without duplicates
  for n in range(len(fps_2)): 
      s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
      for m in range(len(s)):
          qu.append(c_smiles2[n])
          ta.append(c_smiles2[n+1:][m])
          sim.append(s[m])

Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2?

Reproducible Data

DB00607 [H][C@]12SC(C)(C)[C@@H](N1C(=O)[C@H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C@@H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C@@]12[C@H]3CC[C@H](C3)[C@]1([H])C(=O)N(C[C@@H]1CCCC[C@H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C@]1(C)CN(C[C@@]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C@H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1

Solution

  • To answer first on how to install a specific version of Rdkit, you can run this command:

    conda install -c rdkit rdkit=2020.09.2
    

    Coming to the original question, the error is coming because of the function:

    FingerprintMols.FingerprintMol()
    

    For whatever internal reasons, it's converting the first 10 SMILES to a 2048 length vector while the 11th SMILES to a 1024 length vector. The older versions are able to handle this mismatch but newer versions can't. There are two options to fix this:

    1. Downgrade RdKit to an older version using the command I mentioned above.
    2. Fix the length of the vector by passing it as an argument. Basically, replace the line
    FingerprintMols.FingerprintMol(row['mol'])
    

    with

    FingerprintMols.FingerprintMol(row['mol'], minPath=1, maxPath=7, fpSize=2048,
                                   bitsPerHash=2, useHs=True, tgtDensity=0.0,
                                   minSize=128)
    

    In the replacement, all arguments other than fpSize are set to their default values and fpSize is fixed to 2048. Please note that you must pass all the arguments and not just fpSize.