Search code examples
pythonpandasdataframemelt

Cleaning Excel File Using Pandas


I have an excel file which I've read using Pandas the output is as follows:

+------------+-----+-----+-----+
| Type       | 1   | 2   | 3   |
| Category   | A   | A   | C   |
| Dates      | NaN | NaN | NaN |
| 01/01/2021 | 12  | 12  | 9   |
| 02/01/2021 | 10  | 10  | 2   |
| 03/01/2021 | 30  | 16  | NaN |
| 04/01/2021 | 15  | 23  | 4   |
| 05/01/2021 | 14  | 20  | 5   |
+------------+-----+-----+-----+

The first two rows gives information for each time series column wise. So for column 1 the Type is 1 and Category is A. I want to melt the time series and not quite sure how to approach the problem given the structure of the sheet.

Expected Output:

+------------+-------+----------+------+
|   Dates    | Price | Category | Type |
+------------+-------+----------+------+
| 01/01/2021 |    12 | A        |    1 |
| 02/01/2021 |    10 | A        |    1 |
| 03/01/2021 |    30 | A        |    1 |
| 04/01/2021 |    15 | A        |    1 |
| 05/01/2021 |    14 | A        |    1 |
| 01/01/2021 |    12 | B        |    2 |
| 02/01/2021 |    10 | B        |    2 |
| 03/01/2021 |    16 | B        |    2 |
| 04/01/2021 |    23 | B        |    2 |
| 05/01/2021 |    20 | B        |    2 |
| 01/01/2021 |     9 | C        |    3 |
| 02/01/2021 |     2 | C        |    3 |
| 04/01/2021 |     4 | C        |    3 |
| 05/01/2021 |     5 | C        |    3 |
+------------+-------+----------+------+

In case of Type 3 and Category C since value is NaN we drop that date. How can achieve the expected output?


Solution

  • Assuming the following input dataframe:

             col0 col1 col2 col3
    0        Type    1    2    3
    1    Category    A    A    C
    2       Dates  NaN  NaN  NaN
    3  01/01/2021   12   12    9
    4  02/01/2021   10   10    2
    5  03/01/2021   30   16  NaN
    6  04/01/2021   15   23    4
    7  05/01/2021   14   20    5
    

    Here is a working pipeline:

    (df.iloc[3:]
       .set_index('col0').rename_axis('Date') # set first column aside
       # next 3 lines to rename columns index
       .T
       .set_index(pd.MultiIndex.from_arrays(df.iloc[:2, 1:].values, names=df.iloc[:2, 0]))
       .T
       .stack(level=[0,1]) # columns to rows
       .rename('Price')    # rename last unnamed column
       .reset_index()      # all indexes back to columns
    )
    

    output:

              Date Type Category Price
    0   01/01/2021    1        A    12
    1   01/01/2021    2        A    12
    2   01/01/2021    3        C     9
    3   02/01/2021    1        A    10
    4   02/01/2021    2        A    10
    5   02/01/2021    3        C     2
    6   03/01/2021    1        A    30
    7   03/01/2021    2        A    16
    8   04/01/2021    1        A    15
    9   04/01/2021    2        A    23
    10  04/01/2021    3        C     4
    11  05/01/2021    1        A    14
    12  05/01/2021    2        A    20
    13  05/01/2021    3        C     5