I am attempting to roughly estimate that amount of work than could have been done by staff for a given month.
I've got a csv that looks roughly like this (although it's a lot bigger):
+--------+-------+---------------+
| Date | Name | Units of Work |
+--------+-------+---------------+
| 1/1/17 | Bob | 450 |
| 2/1/17 | Alice | 300 |
| 2/1/17 | Bob | 450 |
| 2/1/17 | Larry | 50 |
| 3/1/17 | Alice | 400 |
| 3/1/17 | Bob | 11 |
| 3/1/17 | Larry | 100 |
| 4/1/17 | Alice | 1000 |
| 4/1/17 | Bob | 240 |
| 4/1/17 | Larry | 33 |
+--------+-------+---------------+
I want to:
I can't even meet requirement 1, much less 2 through 5. I'm getting one file for each date. And instead of a column with the median, I get a new column called 'NewColumn' filled with the word 'median', like so:
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('date_trunc').assign(NewColumn='median')
df.median(axis=None, skipna=None, level=None, numeric_only=None)
for i, g in df.groupby('date_trunc'):
g.to_csv('{}.csv'.format(i), header=True, index_label=False, index=False)
+---------+-------+---------------+-----------+
| Date | Name | Units of work | NewColumn |
+---------+-------+---------------+-----------+
| 12/1/16 | Alice | 6222 | median |
| 12/1/16 | Bob | 14530 | median |
| 12/1/16 | Larry | 16887 | median |
+---------+-------+---------------+-----------+
I know I'm probably doing a lot wrong here, but I'd really appreciate some guidance.
What I want to end up with is a single csv with this:
+---------+--------+
| Date | Median |
+---------+--------+
| 12/1/16 | 1110 |
| 1/1/17 | 1400 |
| 2/1/17 | 1200 |
+---------+--------+
I hope the following steps get you closer to your desired CSV output.
First, here's a clean rendition of the input DataFrame for anyone else looking to copy-paste into pd.read_clipboard()
:
Date Name Units of Work
0 Jan-17 Bob 450.0
1 Feb-17 Alice 300.0
2 Feb-17 Bob 450.0
3 Feb-17 Larry 50.0
4 Mar-17 Alice 400.0
5 Mar-17 Bob 11.0
6 Mar-17 Larry 100.0
7 Apr-17 Alice 1000.0
8 Apr-17 Bob 240.0
9 Apr-17 Larry 33.0
# Docs on Python datetime format strings: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: x.strip()), format='%b-%y')
meds = df.groupby('Date')[['Units of Work']].median()
meds
Units of Work
Date
2017-01-01 450.0
2017-02-01 300.0
2017-03-01 100.0
2017-04-01 240.0
# Set an index on which to merge the medians
df2 = df.set_index('Date')
# Pandas is smart enough to merge the 4-row meds DataFrame onto the 10-row df2 DataFrame based on matching index values
df2['Median'] = meds
# Build a boolean mask to pick out "hard workers" and "slackers"
mask = df2['Units of Work'] >= 0.2 * df2['Median']
# "Hard workers," where units of work >= 20% of that date's median
df2[mask]
Name Units of Work Median
Date
2017-01-01 Bob 450.0 450.0
2017-02-01 Alice 300.0 300.0
2017-02-01 Bob 450.0 300.0
2017-03-01 Alice 400.0 100.0
2017-03-01 Larry 100.0 100.0
2017-04-01 Alice 1000.0 240.0
2017-04-01 Bob 240.0 240.0
# Bonus: "slackers," where units of work < 20% of that date's median
df2[~mask]
Name Units of Work Median
Date
2017-02-01 Larry 50.0 300.0
2017-03-01 Bob 11.0 100.0
2017-04-01 Larry 33.0 240.0
df2[mask].groupby('Date').size().mul(meds['Units of Work'])
2017-01-01 450.0
2017-02-01 600.0
2017-03-01 200.0
2017-04-01 480.0