Search code examples
pandasdataframeloggingstringio

reading logfiles with pandas (tab/ newline separated, each row contains a column and value)


I am processing log files with pandas with the following structure, all log files have the same structure and contain data about one machine that should be reducable to 1 row:

Column1     Value1
Column2     Value2
Column3     Value3
Column4     Value4
Column5     Value5

I am using the following code:

import pandas as pd
import glob

log_files = [i for i in glob.glob(inputdir+'\**\*.log', recursive=True)]

appended_data = []

for logfile in log_files:
    fileobject = open(logfile)
    df = pd.read_csv(fileobject, sep='\t',  lineterminator='\n', names=['Column','Value'])
    df = df.pivot(columns = 'Column', values = 'Value')
    appended_data.append(df)

logdf = pd.concat(appended_data)
logdf = logdf.reset_index(drop=True)
logdf = logdf.rename_axis(columns=None)

This however creates 1 row for each column, instead of reducing all rows to one:

Column  Column1 Column2 Column3 Column4 Column5 
0   1   NaN NaN NaN NaN 
1   NaN 2   NaN NaN NaN 
2   NaN NaN 3   NaN NaN 
3   NaN NaN NaN 4   NaN 
4   NaN NaN NaN NaN 5   

The df should have the following format:

    Column1 Column2 Column3 Column4 Column5 
0   1       2       3       4       5   

Is there an effective way to solve this issue by changing the read CSV settings or by transforming the df?

The following solution works but I don't think it's particulary great.

df.sort_values(by='A',inplace=True)
df = df.fillna(method='ffill')
df.drop_duplicates(["A"],keep='last',inplace=True)

Solution

  • I think you are concatenating along rows, like this:

    import pandas as pd
    
    appended_data = [
        pd.DataFrame({"Column1": ["1"],}),
        pd.DataFrame({"Column2": ["2"],}),
        pd.DataFrame({"Column3": ["3"],}),
        pd.DataFrame({"Column4": ["4"],}),
        pd.DataFrame({"Column5": ["5"],}),
    ]
    
    logdf = pd.concat(appended_data)
    logdf = logdf.reset_index(drop=True)
    logdf = logdf.rename_axis(columns=None)
    
    print(logdf)
    # Output
      Column1 Column2 Column3 Column4 Column5
    0       1     NaN     NaN     NaN     NaN
    1     NaN       2     NaN     NaN     NaN
    2     NaN     NaN       3     NaN     NaN
    3     NaN     NaN     NaN       4     NaN
    4     NaN     NaN     NaN     NaN       5
    

    As per Pandas documentation, you can concatenate along columns by specifying axis=1, like this:

    logdf = (
        pd
        .concat(appended_data, axis=1)
        .reset_index(drop=True)
        .rename_axis(columns=None)
    )
    
    print(logdf)
    # Output
      Column1 Column2 Column3 Column4 Column5
    0       1       2       3       4       5