Search code examples
pythonpandasdataframeindexing

Python Pandas - how to read in data from list (data) and columns (separate list)


I'm running into a situation I don't know what to do:

The data is a list, no index. Sample data:

data = [
 {'fields': ['2024-10-07T21:22:01', 'USER-A', 21,  0,  0, 21]},
 {'fields': ['2024-10-07T21:18:28', 'USER-B', 20, 20,  0,  0, 0, 45]}
]

The column header is in another:

cols = ['Created On', 'Created By', 'Transaction Count (ALL)',
        'X Pending', 'X Cancelled (X)', 'X Completed (Y)']

I have tried using pandas.DataFrame as well as json_normalize, I either get a single column table with each value as a row, or I got all values as a column, and when I try with using "fields", it tells me "list indices must be integers or slices, not str" which I don't understand why I get this... what is the best way to have these info into a dataframe please?

(the number of data elements and number of column headers may not be consistent just for example sake, the real data has things aligned)


Solution

  • You could combine two DataFrame constructors:

    data = [{'fields': ['2024-10-07T21:22:01', 'USER-A', 21, 0, 0, 21]},
            {'fields': ['2024-10-07T21:18:28', 'USER-B', 20, 20, 0, 0, 0, 45]},
           ]
    
    out = pd.DataFrame(pd.DataFrame(data)['fields'].tolist())
    

    Output:

                         0       1   2   3  4   5    6     7
    0  2024-10-07T21:22:01  USER-A  21   0  0  21  NaN   NaN
    1  2024-10-07T21:18:28  USER-B  20  20  0   0  0.0  45.0
    

    If you also have a list of columns cols, you could truncate the columns:

    cols = ['Created On', 'Created By', 'Transaction Count (ALL)',
            'X Pending', 'X Cancelled (X)', 'X Completed (Y)']
    
    out = pd.DataFrame(pd.DataFrame(data)['fields'].str[:len(cols)].tolist(),
                       columns=cols)
    

    Output:

                Created On Created By  Transaction Count (ALL)  X Pending  X Cancelled (X)  X Completed (Y)
    0  2024-10-07T21:22:01     USER-A                       21          0                0               21
    1  2024-10-07T21:18:28     USER-B                       20         20                0                0
    

    Or rename to keep the extra columns:

    out = (pd.DataFrame(pd.DataFrame(data)['fields'].tolist())
             .rename(columns=dict(enumerate(cols)))
           )
    

    Output:

                Created On Created By  Transaction Count (ALL)  X Pending  X Cancelled (X)  X Completed (Y)    6     7
    0  2024-10-07T21:22:01     USER-A                       21          0                0               21  NaN   NaN
    1  2024-10-07T21:18:28     USER-B                       20         20                0                0  0.0  45.0
    

    But, honestly, better pre-process in pure python, this will be more efficient/explicit:

    # truncation
    out = pd.DataFrame((dict(zip(cols, d['fields'])) for d in data))
    
    # alternative truncation
    out = pd.DataFrame([d['fields'][:len(cols)] for d in data], columns=cols)
    
    # renaming
    out = (pd.DataFrame([d['fields'] for d in data])
             .rename(columns=dict(enumerate(cols)))
          )