Search code examples
pythonexcelpandasxlrd

In Python How Do I re-organise inconsistent columns from Excel?


I have got a bunch of reports, all containing the same information, but the columns are inconsistent in order and titles, so when I run the functions to parse the information I am not getting out what I want.

How can I use python to organise the columns in a consistent way to work with later down the line?

Currently the columns look like;

['Comment', 'Rec Date', 'Receipt Number', 'Rec Amt', 'Receipt Method', 'Status']
['Comment', 'State', 'Rec No', 'Rec Date', 'Rec Amt', 'Status']
['Receipt Number', 'Rec Date', 'Rec Amt', 'Comment', 'State', 'Status']
['Comment', 'Receipt Number', 'Rec Date', 'Rec Amt', 'State', 'Rec Method']
['Comment', 'Receipt Number', 'Rec Date', 'Rec Amt', 'Receipt Method', 'State']
['Comment', 'Receipt Number', 'Rec Date', 'Rec Amt', 'State', 'Rec Method']

I would like all of these categories to be included as columns. (Points that don't have a value for a specific column would get some placeholder or empty value.)


Solution

  • First read_excel for DataFrame:

    df = pd.read_excel('file.xlsx') 
    

    I believe you need rename columns by dict and then sort them by sort_index:

    d = {'Rec No':'Receipt Number', 'Rec Method':'Receipt Method',...}
    df = df.rename(columns=d).sort_index(axis=1)