Search code examples
pythonpandasdata-conversionchemistry

Converting .CIF files to a dataset (csv, xls, etc)


how are you all? Hope you're doing good!

So, get this. I need to convert some .CIF files (found here: https://www.ccdc.cam.ac.uk/support-and-resources/downloads/ - MOF Collection) to a format that i can use with pandas, such as CSV or XLS. I'm researching about using MOF's for hydrogen storage, and this collection from Cambrigde's Structural Database would do wonders for me.

So far, i was able to convert them using ToposPro, but not to a format that i can use with Pandas readTo.

So, do any of you know of a way to do this? I've also read about pymatgen and matminer, but i've never used them before.

Also, sorry for any mishap with my writing, english isn't my main language. And thanks for your help!


Solution

  • To read a .CIF file as a pandas DataFrame, you can use Bio.PDB.MMCIF2Dict module from biopython to firstly parse the .CIF file and return a dictionnary. Then, you will need pandas.DataFrame.from_dict to create a dataframe from the bio-dictionnary. Finally, you have to pandas.DataFrame.transpose to make rows as columns (since we'll define index as an orientation for the dict to deal with "missing" values).

    You need to install biopython by executing this line in your (Windows) terminal :

    pip install biopython
    

    Then, you can use the code below to read a specific .CIF file :

    import pandas as pd
    from Bio.PDB.MMCIF2Dict import MMCIF2Dict
    
    dico = MMCIF2Dict(r"path_to_the_MOF_collection\abavij_P1.cif")
    df = pd.DataFrame.from_dict(dico, orient='index')
    df = df.transpose()
    
    >>> display(df)

    enter image description here

    Now, if you need the read the whole MOF collection (~10k files) as a dataframe, you can use this :

    from pathlib import Path
    import pandas as pd
    from Bio.PDB.MMCIF2Dict import MMCIF2Dict
    from time import time
    
    mof_collection = r"path_to_the_MOF_collection"
    
    start = time()
    
    list_of_cif = []
    for file in Path(mof_collection).glob('*.cif'):
        dico = MMCIF2Dict(file)
        temp = pd.DataFrame.from_dict(dico, orient='index')
        temp = temp.transpose()
        temp.insert(0, 'Filename', Path(file).stem) #to get the .CIF filename
        list_of_cif.append(temp)
    df = pd.concat(list_of_cif)
    
    end = time()
    
    print(f'The DataFrame of the MOF Collection was created in {end-start} seconds.')
    df
    
    >>> output

    enter image description here

    I'm sure you're aware that the .CIF files may have different number of columns. So, feel free to concat (or not) the MOF collection. And last but not least, if you want to get a .csv and/or an .xlsx file of your dataframe, you can use either pandas.DataFrame.to_csv or pandas.DataFrame.to_excel:

    df.to_csv('your_output_filename.csv', index=False)
    df.to_excel('your_output_filename.xlsx', index=False)
    

    EDIT :

    To read the structure of a .CIF file as a DataFrame, you can use the as_dataframe() method by using pymatgen :

    from pymatgen.io.cif import CifParser
    
    parser = CifParser("abavij_P1.cif")
    structure = parser.get_structures()[0]
    structure.as_dataframe()
    
    >>> output

    enter image description here

    In case you need to check if a .CIF file has a valid structure, you can use :

    if len(structure)==0:
        print('The .CIF file has no structure')
    

    Or:

    try:
        structure = parser.get_structures()[0]
    except:
        print('The .CIF file has no structure')