I'm running into a situation I don't know what to do:
The data is a list, no index. Sample data:
data = [
{'fields': ['2024-10-07T21:22:01', 'USER-A', 21, 0, 0, 21]},
{'fields': ['2024-10-07T21:18:28', 'USER-B', 20, 20, 0, 0, 0, 45]}
]
The column header is in another:
cols = ['Created On', 'Created By', 'Transaction Count (ALL)',
'X Pending', 'X Cancelled (X)', 'X Completed (Y)']
I have tried using pandas.DataFrame
as well as json_normalize
, I either get a single column table with each value as a row, or I got all values as a column, and when I try with using "fields", it tells me "list indices must be integers or slices, not str" which I don't understand why I get this... what is the best way to have these info into a dataframe please?
(the number of data elements and number of column headers may not be consistent just for example sake, the real data has things aligned)
You could combine two DataFrame
constructors:
data = [{'fields': ['2024-10-07T21:22:01', 'USER-A', 21, 0, 0, 21]},
{'fields': ['2024-10-07T21:18:28', 'USER-B', 20, 20, 0, 0, 0, 45]},
]
out = pd.DataFrame(pd.DataFrame(data)['fields'].tolist())
Output:
0 1 2 3 4 5 6 7
0 2024-10-07T21:22:01 USER-A 21 0 0 21 NaN NaN
1 2024-10-07T21:18:28 USER-B 20 20 0 0 0.0 45.0
If you also have a list of columns cols
, you could truncate the columns:
cols = ['Created On', 'Created By', 'Transaction Count (ALL)',
'X Pending', 'X Cancelled (X)', 'X Completed (Y)']
out = pd.DataFrame(pd.DataFrame(data)['fields'].str[:len(cols)].tolist(),
columns=cols)
Output:
Created On Created By Transaction Count (ALL) X Pending X Cancelled (X) X Completed (Y)
0 2024-10-07T21:22:01 USER-A 21 0 0 21
1 2024-10-07T21:18:28 USER-B 20 20 0 0
Or rename
to keep the extra columns:
out = (pd.DataFrame(pd.DataFrame(data)['fields'].tolist())
.rename(columns=dict(enumerate(cols)))
)
Output:
Created On Created By Transaction Count (ALL) X Pending X Cancelled (X) X Completed (Y) 6 7
0 2024-10-07T21:22:01 USER-A 21 0 0 21 NaN NaN
1 2024-10-07T21:18:28 USER-B 20 20 0 0 0.0 45.0
But, honestly, better pre-process in pure python, this will be more efficient/explicit:
# truncation
out = pd.DataFrame((dict(zip(cols, d['fields'])) for d in data))
# alternative truncation
out = pd.DataFrame([d['fields'][:len(cols)] for d in data], columns=cols)
# renaming
out = (pd.DataFrame([d['fields'] for d in data])
.rename(columns=dict(enumerate(cols)))
)