Search code examples
pythonpandasregexfindall

Merging findall out multiple outputs into datframe


My code is intended to read from multiple files (2 examples below) and match digits on multiple lines of each file, and then combine all matches and filenames where found into a dataframe. However, my first issue is that multiple findall outputs are in multiple lines and I'm not sure how to append these lines properly - findall outputs are like:

65
45
78
etc

Two file examples are below:

F1:

trust 65
musca 
linca 75
trig 
torst 50

F2:

munk 65
liki 34
grub

I want my code to generate the following final dataframe:

Filename score
F1  65
F1  75
F1  50
F2  65
F2  34

My code attempt:

import os
import re
import pandas as pd

final={}
for f in *.txt:
    with open(f,"r") as In1:
        (filename,ext)=os.path.splitext(f)
        for line in In1:
            m=re.findall(r'\d+',line)
            if len(match) > 0:
                all=[]
                all.append(m)
                final[filename]=all

df=pd.DataFrame(final.items(),columns=['Filename','Score']

Can someone point me in the right direction please?


Solution

  • Here's a way to do what your question asks:

    import pandas as pd
    from io import StringIO
    
    fileStrings = {
    'F1': '''
    trust 65
    musca 
    linca 75
    trig 
    torst 50
    ''',
    
    'F2': '''
    munk 65
    liki 34
    grub
    '''
    }
    
    res = pd.concat([
        pd.DataFrame({
            'Filename': k,
            'score':pd.read_csv(StringIO(v), header=None, sep=' ').iloc[:,1].dropna()}) 
        for k, v in fileStrings.items()]).reset_index(drop=True)
    print(res)
    

    Output:

      Filename  score
    0       F1   65.0
    1       F1   75.0
    2       F1   50.0
    3       F2   65.0
    4       F2   34.0
    

    The above example uses the strings read from the two files detailed in the question. Changing the variable fileStrings to contain the names and string contents of any number of files will also work.

    UPDATE:

    Here's a way to do what your question asks but for an arbitrary number of files:

    import pandas as pd
    import os
    dir = os.path.join(os.getcwd(), "test_dir")
    fileNames = [file for file in os.listdir(dir) if file.endswith(".txt")]
    print(fileNames)
    res = pd.concat([
        pd.DataFrame({
            'Filename': file,
            'score':pd.read_csv(os.path.join(dir, file), header=None, sep=' ').iloc[:,1].dropna()}) 
        for file in fileNames]).reset_index(drop=True)
    print(res)
    

    State of input directory:

    F1.txt:
    trust 65
    musca 
    linca 75
    trig 
    torst 50
    
    F2.txt:
    munk 65
    liki 34
    grub
    
    F3.txt:
    a 3
    b
    c 3
    
    F4.txt:
    a 4
    b
    c 4
    

    Output:

    ['F1.txt', 'F2.txt', 'F3.txt', 'F4.txt']
      Filename  score
    0   F1.txt   65.0
    1   F1.txt   75.0
    2   F1.txt   50.0
    3   F2.txt   65.0
    4   F2.txt   34.0
    5   F3.txt    3.0
    6   F3.txt    3.0
    7   F4.txt    4.0
    8   F4.txt    4.0