Search code examples
pythonpandasmulti-index

Reformatting multiple rows from excel as usable (Multi)Index


I am new to pandas - I am trying to extract some useful information from a (very graphically designed) per week Excel file that represents the booking of certain rooms, 52 files for one year. It contains the name of the employee, the working hours and the project. i am mainly interested in

  • Which employee worked on which project for how many days / hours

From all i read about Panda, this is pretty difficult case, since the representation of the data is not well suited for extraction.

Basically the only relevant info in the first column is the ROOM statement, which is then followed by descriptive 4 rows of text that i dont need in this context. Starting with the ROOM string, on each date column, there are 4 lines of relevant info which i need to extract. For my usecase, i dont need to know who worked in which room, but the ROOMS are what are used as the index.

Right now i am stuck in how to reformat the first column in a way that Panda can make use of it in a meaningful way. My idea is, that i i search for any line that contains ROOM, create a new index Column from the resulting boolean statement, use these as a multiindex, etc. but i am stuck at the very beginning.

Before i continue in that direction, i would like to understand if the general approach on how to make a document like this more usable for Pandas, i wanted to check if there are some best practices on how to deal with it and if my idea of creating a multindex out of the 4 row room description is the right approach... Like i said i am very new to pandas, so please forgive if there are general misunderstandings on how to work with this - or IF what i want to do is doable in Pandas at all...

import pandas as pd
from pathlib import Path
# Assign spreadsheet filename to `file`
file = './data/Week02-2019.xlsx'                                                                   
# Load spreadsheet
xl = pd.ExcelFile(file)

# Remove all Sorts of crap on loading
df1 = xl.parse(sheet_name='Booking', header=1, parse_dates=[1], na_values=['xxx',''], skiprows=[2,3,4,5,6,7,8,9,10,19,28,37,46,64], usecols="A:H")

print (df1.iloc[:,0].str.contains("ROOM"),"NewIndex")

+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
|        2018         |   01.01.2018   |   02.08.2018   |   03.08.2018   | 04.08.2018 | 05.08.2018 | 06.08.2018 | 07.08.2018 |
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
| ROOM 01 (Morning)   | John Doe       | Jane Doe       | Donny Doe      |            |            |            |            |
| Very Nice           | Project#       | Project#       | Project#       |            |            |            |            |
| Good Projector      | Project Title  | Project Title  | Project Title  |            |            |            |            |
| Telephone 1234      | 9:30-17.00     | 8-13.00        | 12-14.00       |            |            |            |            |
| ---------           | ----           | ----           | ----           | ---        | ----       | ---        | ---        |
| ROOM 01 (Afternoon) | Alan Smithee   | Susi  Smithee  | Donald Smithee |            |            |            |            |
| Very Nice           | Project#       | Project#       | Project#       |            |            |            |            |
| Good Projector      | Project Title  | Project Title  | Project Title  |            |            |            |            |
| Telephone 1234      | 17:30-21.00    | 13.15-16.00    | 14.15-16.00    |            |            |            |            |
| -----               | ----------     | ----------     | ---------      | ----       | ---        | ---        |            |
| ROOM 02 (Morning)   | Jimmy Doe      | Duffy Duck     | Benny Blanco   |            |            |            |            |
| Not So Nice         | Project#       | Project#       | Project#       |            |            |            |            |
| Whiteboard          | Project Title  | Project Title  | Project Title  |            |            |            |            |
| Telephone 5678      | 9:30-17.00     | 8-13.00        | 12-14.00       |            |            |            |            |
| ---------           | ----           | ----           | ----           | ---        | ----       | ---        | ---        |
| ROOM 02 (Afternoon) | Doris Day      | Teddy Kaczinsky| Ru Paul        |            |            |            |
| Not so Nice         | Project#       | Project#       | Project#       |            |            |            |            |
| Whiteboard          | Project Title  | Project Title  | Project Title  |            |            |            |            |
| Telephone 5678      | 17:30-21.00    | 13.15-16.00    | 14.15-16.00    |            |            |            |            |

+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+

Solution

  • Use DataFrame.set_index with DataFrame.stack and Series.unstack for reshape, output get MultiIndex:

    first = df.columns[0]
    
    #repeat only ROOM data in first column
    df[first] = df[first].where(df[first].str.contains("ROOM")).ffill()
    #create helper columns
    df['group'] = df.index % 4
    #new columns names
    d = {0:'name', 1:'project', 2:'project title', 3: 'time'}
    
    
    df1 = (df.set_index([first, 'group'])
             .rename(columns = lambda x: pd.to_datetime(x, format='%d.%m.%Y'))
             .stack()
             .unstack(1)
             .rename(columns=d)
             .swaplevel(1,0)
             .sort_index()
             .rename_axis(index=['date', 'room'], columns=None)
            )
    

    print (df1)
                                               name   project  project title  \
    date       room                                                            
    2018-01-01 ROOM 01 (Afternoon)     Alan Smithee  Project#  Project Title   
               ROOM 01 (Morning)           John Doe  Project#  Project Title   
               ROOM 02 (Afternoon)        Doris Day  Project#  Project Title   
               ROOM 02 (Morning)          Jimmy Doe  Project#  Project Title   
    2018-08-02 ROOM 01 (Afternoon)    Susi  Smithee  Project#  Project Title   
               ROOM 01 (Morning)           Jane Doe  Project#  Project Title   
               ROOM 02 (Afternoon)  Teddy Kaczinsky  Project#  Project Title   
               ROOM 02 (Morning)         Duffy Duck  Project#  Project Title   
    2018-08-03 ROOM 01 (Afternoon)   Donald Smithee  Project#  Project Title   
               ROOM 01 (Morning)          Donny Doe  Project#  Project Title   
               ROOM 02 (Afternoon)          Ru Paul  Project#  Project Title   
               ROOM 02 (Morning)       Benny Blanco  Project#  Project Title   
    
                                           time  
    date       room                              
    2018-01-01 ROOM 01 (Afternoon)  17:30-21.00  
               ROOM 01 (Morning)     9:30-17.00  
               ROOM 02 (Afternoon)  17:30-21.00  
               ROOM 02 (Morning)     9:30-17.00  
    2018-08-02 ROOM 01 (Afternoon)  13.15-16.00  
               ROOM 01 (Morning)        8-13.00  
               ROOM 02 (Afternoon)  13.15-16.00  
               ROOM 02 (Morning)        8-13.00  
    2018-08-03 ROOM 01 (Afternoon)  14.15-16.00  
               ROOM 01 (Morning)       12-14.00  
               ROOM 02 (Afternoon)  14.15-16.00  
               ROOM 02 (Morning)       12-14.00  
    

    EDIT: Error means there are some ROOM values duplicated, so need handle it by GroupBy.cumcount for new level in MultiIndex:

    print (df)
                       2018     01.01.2018       02.08.2018      03.08.2018
    0     ROOM 01 (Morning)       John Doe         Jane Doe       Donny Doe
    1             Very Nice       Project#         Project#        Project#
    2        Good Projector  Project Title    Project Title   Project Title
    3        Telephone 1234     9:30-17.00          8-13.00        12-14.00
    4     ROOM 01 (Morning)   Alan Smithee    Susi  Smithee  Donald Smithee
    5             Very Nice       Project#         Project#        Project#
    6        Good Projector  Project Title    Project Title   Project Title
    7        Telephone 1234    17:30-21.00      13.15-16.00     14.15-16.00
    8     ROOM 02 (Morning)      Jimmy Doe       Duffy Duck    Benny Blanco
    9           Not So Nice       Project#         Project#        Project#
    10           Whiteboard  Project Title    Project Title   Project Title
    11       Telephone 5678     9:30-17.00          8-13.00        12-14.00
    12  ROOM 02 (Afternoon)      Doris Day  Teddy Kaczinsky         Ru Paul
    13          Not so Nice       Project#         Project#        Project#
    14           Whiteboard  Project Title    Project Title   Project Title
    15       Telephone 5678    17:30-21.00      13.15-16.00     14.15-16.00
    

    first = df.columns[0]
    
    df[first] = df[first].where(df[first].str.contains("ROOM")).ffill()
    df['group'] = df.index % 4
    d = {0:'name', 1:'project', 2:'project title', 3: 'time'}
    
    df1 = (df.set_index([first, 'group'])
             .rename(columns = lambda x: pd.to_datetime(x, format='%d.%m.%Y'))
             .stack()
             .to_frame())
    g = df1.groupby(level=[0,1]).cumcount()
    
    df1 = (df1.set_index(g, append=True)[0]
              .unstack(1)
             .rename(columns=d)
             .swaplevel(1,0)
             .sort_index()
             .rename_axis(index=['date', 'room', 'tmp'], columns=None)
            )
    

    print (df1)
                                                   name   project  project title  \
    date       room                tmp                                             
    2018-01-01 ROOM 01 (Morning)   0           John Doe  Project#  Project Title   
                                   3       Alan Smithee  Project#  Project Title   
               ROOM 02 (Afternoon) 0          Doris Day  Project#  Project Title   
               ROOM 02 (Morning)   0          Jimmy Doe  Project#  Project Title   
    2018-08-02 ROOM 01 (Morning)   1           Jane Doe  Project#  Project Title   
                                   4      Susi  Smithee  Project#  Project Title   
               ROOM 02 (Afternoon) 1    Teddy Kaczinsky  Project#  Project Title   
               ROOM 02 (Morning)   1         Duffy Duck  Project#  Project Title   
    2018-08-03 ROOM 01 (Morning)   2          Donny Doe  Project#  Project Title   
                                   5     Donald Smithee  Project#  Project Title   
               ROOM 02 (Afternoon) 2            Ru Paul  Project#  Project Title   
               ROOM 02 (Morning)   2       Benny Blanco  Project#  Project Title   
    
                                               time  
    date       room                tmp               
    2018-01-01 ROOM 01 (Morning)   0     9:30-17.00  
                                   3    17:30-21.00  
               ROOM 02 (Afternoon) 0    17:30-21.00  
               ROOM 02 (Morning)   0     9:30-17.00  
    2018-08-02 ROOM 01 (Morning)   1        8-13.00  
                                   4    13.15-16.00  
               ROOM 02 (Afternoon) 1    13.15-16.00  
               ROOM 02 (Morning)   1        8-13.00  
    2018-08-03 ROOM 01 (Morning)   2       12-14.00  
                                   5    14.15-16.00  
               ROOM 02 (Afternoon) 2    14.15-16.00  
               ROOM 02 (Morning)   2       12-14.00