I'm a newbie at Pandas but have a very strong SQL background. I actually know how to do this both in Excel and SQL but would like to see how to do this in Pandas to help me learn. I tried reading ahead in 'Python for Data Analysis' and I'm 1/2 a Coursera course into Pandas but this is well beyond my skill set.
What I would like to see how to do in Pandas is a Year over Year variation of how much time it took us to make certain parts for our customers.
The output would look like this:
Year Shipped_Part_No Shipped_Part_Rev Labor_Hours Gross Margin
Min Max Avg Sdev Min Max Avg Sdev
2013 Widget1 AM 0 3 2.1 1.8 -.18 .1 .18 .18
2014 Widget1 AM 2 7 2.9 1.2 -.1 .2 .28 .11
2015 Widget1 AM 1 2 2.2 1.4 -.13 .3 .81 .12
As far as I can tell I need to set the index to be both ['Shipped_Part_No'] and ['Shipped_Part_Rev'] and then do some grouping and run some aggregate functions.
I'm open to any pointers on how to think about this or display it for better readability - my mind is so SQL focused I'm having a hard time seeing how to use a data-frame to accomplish it.
I have a CSV output file from a SQL query I wrote, the file is here:
https://drive.google.com/open?id=0B4xdnV0LFZI1czBUT19YanVPNFk
Usually I would post some code of what I've at least tried but it all keeps failing, so I think it's unhelpful.
I'm running Anaconda with Python 3.5 on a Windows PC.
Try this:
In [168]: (df.groupby([df.Ship_Date.dt.year,'Shipped_Part_No', 'Shipped_Part_Rev'])
...: [['Labor_Hours','Gross_Margin']]
...: .agg(['min','max','mean','std']))
...:
Out[168]:
Labor_Hours Gross_Margin
min max mean std min max mean std
Ship_Date Shipped_Part_No Shipped_Part_Rev
2013 Widget1 AM 9 68 38.5 41.719300 -0.11 -0.11 -0.11 0.0
2014 Widget1 AM 2 6 4.0 2.828427 -0.11 -0.11 -0.11 0.0
2015 Widget1 AM 1 43 10.6 18.174157 -0.11 -0.11 -0.11 0.0
PS the solution above assumes that the Ship_Date
column is of datetime-like dtype