Search code examples
regexpython-3.xpandasglob

data frame with pandas not outputing tabular


I have been working on extracting data from a large number of files. I want to form a table of the data, with the file base name as the left most column and the numerical data in the next. So far, I have been testing on a folder containing 8 files, but am hoping to be able to read hundreds.

I have tried adding an index, but that seemed to cause more problems. I am attaching the closest working code I have come up with, alongside the output.

In:

import re, glob
import pandas as pd

pattern = re.compile('-\d+\D\d+\skcal/mol', flags=re.S)
for file in glob.glob('*rank_*.pdb'):
    with open(file) as fp:
        for result in pattern.findall(fp.read()):
            Dock_energy = {file:[],result:[]}
            df = pd.DataFrame(Dock_energy)
            df.append(df)
    df = df.append(df)
    print(df)

This seems to work for extracting the data, but it is not in the form I am looking for.

Out:

Empty DataFrame
Columns: [-10.02 kcal/mol, MII_rank_8.pdb]
Index: []
Empty DataFrame
Columns: [-12.51 kcal/mol, MII_rank_5.pdb]
Index: []
Empty DataFrame
Columns: [-13.47 kcal/mol, MII_rank_4.pdb]
Index: []
Empty DataFrame
Columns: [-14.67 kcal/mol, MII_rank_2.pdb]
Index: []
Empty DataFrame
Columns: [-13.67 kcal/mol, MII_rank_3.pdb]
Index: []
Empty DataFrame
Columns: [-14.80 kcal/mol, MII_rank_1.pdb]
Index: []
Empty DataFrame
Columns: [-11.45 kcal/mol, MII_rank_7.pdb]
Index: []
Empty DataFrame
Columns: [-12.47 kcal/mol, MII_rank_6.pdb]
Index: []

What is causing the fractured table, and why are my columns in reverse order from what I am hoping? Any help is greatly appreciate.


Solution

  • This should be closer to what you intend:

    all_data = []
    for file in glob.glob('*rank_*.pdb'):
        with open(file) as fp:
            file_data = []
            for result in pattern.findall(fp.read()):
                file_data.append([file, result])
        all_data.extend(file_data)
    df = pd.DataFrame(all_data, columns=['file', 'result'])
    print(df)