Search code examples
pythonpandascsvdatemedian

splitting a data frame by date and computing median for all rows with each date


I am attempting to roughly estimate that amount of work than could have been done by staff for a given month.

I've got a csv that looks roughly like this (although it's a lot bigger):

+--------+-------+---------------+
|  Date  | Name  | Units of Work |
+--------+-------+---------------+
| 1/1/17 | Bob   |           450 |
| 2/1/17 | Alice |           300 |
| 2/1/17 | Bob   |           450 |
| 2/1/17 | Larry |            50 |
| 3/1/17 | Alice |           400 |
| 3/1/17 | Bob   |            11 |
| 3/1/17 | Larry |           100 |
| 4/1/17 | Alice |          1000 |
| 4/1/17 | Bob   |           240 |
| 4/1/17 | Larry |            33 |
+--------+-------+---------------+

I want to:

  1. Compute the median 'Units of work' for each 'Date'
  2. Determine if any 'Name' did less than 20% of the median 'Units of work' for that 'Date'
  3. Remove 'Name' if it did fewer than 20% of the median
  4. Multiply the count of 'Name' left for the 'Date' by the median 'Units of work' for that 'Date'
  5. Output a new csv that has each 'Date', appearing only once, on its own row, and that date's median 'Units of work' multiplied by the remaining 'Name' for that 'Date'

I can't even meet requirement 1, much less 2 through 5. I'm getting one file for each date. And instead of a column with the median, I get a new column called 'NewColumn' filled with the word 'median', like so:

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('date_trunc').assign(NewColumn='median')
df.median(axis=None, skipna=None, level=None, numeric_only=None)
for i, g in df.groupby('date_trunc'):
    g.to_csv('{}.csv'.format(i), header=True, index_label=False, index=False)
    +---------+-------+---------------+-----------+
    |  Date   | Name  | Units of work | NewColumn |
    +---------+-------+---------------+-----------+
    | 12/1/16 | Alice |          6222 | median    |
    | 12/1/16 | Bob   |         14530 | median    |
    | 12/1/16 | Larry |         16887 | median    |
    +---------+-------+---------------+-----------+

I know I'm probably doing a lot wrong here, but I'd really appreciate some guidance.

What I want to end up with is a single csv with this:

+---------+--------+
|  Date   | Median |
+---------+--------+
| 12/1/16 |   1110 |
| 1/1/17  |   1400 |
| 2/1/17  |   1200 |
+---------+--------+

Solution

  • I hope the following steps get you closer to your desired CSV output.

    First, here's a clean rendition of the input DataFrame for anyone else looking to copy-paste into pd.read_clipboard():

         Date     Name     Units of Work
    0   Jan-17    Bob               450.0
    1   Feb-17    Alice             300.0
    2   Feb-17    Bob               450.0
    3   Feb-17    Larry              50.0
    4   Mar-17    Alice             400.0
    5   Mar-17    Bob                11.0
    6   Mar-17    Larry             100.0
    7   Apr-17    Alice            1000.0
    8   Apr-17    Bob               240.0
    9   Apr-17    Larry              33.0
    

    0. Cast dates to python datetime (for sensible sort order)

    # Docs on Python datetime format strings: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
    df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: x.strip()), format='%b-%y')
    

    1. For each date, find the median units of work

    meds = df.groupby('Date')[['Units of Work']].median()
    meds
        Units of Work
    Date    
    2017-01-01  450.0
    2017-02-01  300.0
    2017-03-01  100.0
    2017-04-01  240.0
    

    2, 3. Drop rows where units of work < 20% of that date's median units of work

    # Set an index on which to merge the medians
    df2 = df.set_index('Date')
    # Pandas is smart enough to merge the 4-row meds DataFrame onto the 10-row df2 DataFrame based on matching index values
    df2['Median'] = meds 
    
    # Build a boolean mask to pick out "hard workers" and "slackers"
    mask = df2['Units of Work'] >= 0.2 * df2['Median']
    
    # "Hard workers," where units of work >= 20% of that date's median
    df2[mask]
                   Name  Units of Work  Median
    Date                                      
    2017-01-01   Bob             450.0   450.0
    2017-02-01   Alice           300.0   300.0
    2017-02-01   Bob             450.0   300.0
    2017-03-01   Alice           400.0   100.0
    2017-03-01   Larry           100.0   100.0
    2017-04-01   Alice          1000.0   240.0
    2017-04-01   Bob             240.0   240.0
    
    # Bonus: "slackers," where units of work < 20% of that date's median
    df2[~mask]
                   Name  Units of Work  Median
    Date                                      
    2017-02-01   Larry            50.0   300.0
    2017-03-01   Bob              11.0   100.0
    2017-04-01   Larry            33.0   240.0
    

    4. For each date, multiply the number of "hard workers" by the median units of work

    df2[mask].groupby('Date').size().mul(meds['Units of Work'])
    2017-01-01    450.0
    2017-02-01    600.0
    2017-03-01    200.0
    2017-04-01    480.0