Search code examples
pythonpandasdata-cleaning

Systematic heading-to-value data cleaning in Pandas


I have a very wide dataset which consists of hundreds of date-value column pairs - however the heading of the values column contains the reference of the site from which the data is taken. I'd like to take this header as a new "site_name" column before pivoting this data to a long format.

The data for each site is the same 2-column format, so I'd like to be able to apply a solution across the whole dataset at once.

My code below illustrates the problem on a single date-value pair

Note: I've used asterisks to mean I'm describing the column names, rather than quoting them

import pandas as pd

current = pd.DataFrame({"*unnamed_date_column*" : ["2021-10-21", "2021-10-22", "2021-10-23"],
                        "*unique_site_name*"    : [1.1, 1.2, 1.3]})

desired = pd.DataFrame({"date"   : ["2021-10-21", "2021-10-22", "2021-10-23"], 
                        "values" : [1.1, 1.2, 1.3], 
                        "site"   : ["unique_site_name", "unique_site_name", "unique_site_name"]})

Solution

  • Difficult to know how this will generalize without knowing more examples, but you could try:

    desired = (current
     .assign(site=current.columns[-1]) # arbitrarily chose to index from end
     .rename(columns=dict(zip(current.columns, ['date', 'values'])))
    )
    

    output:

             date  values                site
    0  2021-10-21     1.1  *unique_site_name*
    1  2021-10-22     1.2  *unique_site_name*
    2  2021-10-23     1.3  *unique_site_name*