Search code examples
pythonpandasanalytics

Python Pandas Dataframe year over year comparison


I'm a newbie at Pandas but have a very strong SQL background. I actually know how to do this both in Excel and SQL but would like to see how to do this in Pandas to help me learn. I tried reading ahead in 'Python for Data Analysis' and I'm 1/2 a Coursera course into Pandas but this is well beyond my skill set.

What I would like to see how to do in Pandas is a Year over Year variation of how much time it took us to make certain parts for our customers.

The output would look like this:

Year  Shipped_Part_No  Shipped_Part_Rev  Labor_Hours       Gross Margin
                                        Min Max Avg Sdev   Min Max Avg Sdev
2013  Widget1           AM              0   3   2.1  1.8  -.18  .1  .18  .18
2014  Widget1           AM              2   7   2.9  1.2  -.1   .2  .28  .11
2015  Widget1           AM              1   2   2.2  1.4  -.13  .3  .81  .12

As far as I can tell I need to set the index to be both ['Shipped_Part_No'] and ['Shipped_Part_Rev'] and then do some grouping and run some aggregate functions.

I'm open to any pointers on how to think about this or display it for better readability - my mind is so SQL focused I'm having a hard time seeing how to use a data-frame to accomplish it.

I have a CSV output file from a SQL query I wrote, the file is here:

https://drive.google.com/open?id=0B4xdnV0LFZI1czBUT19YanVPNFk

Usually I would post some code of what I've at least tried but it all keeps failing, so I think it's unhelpful.

I'm running Anaconda with Python 3.5 on a Windows PC.


Solution

  • Try this:

    In [168]: (df.groupby([df.Ship_Date.dt.year,'Shipped_Part_No', 'Shipped_Part_Rev'])
         ...:     [['Labor_Hours','Gross_Margin']]
         ...:    .agg(['min','max','mean','std']))
         ...:
    Out[168]:
                                               Labor_Hours                      Gross_Margin
                                                       min max  mean        std          min   max  mean  std
    Ship_Date Shipped_Part_No Shipped_Part_Rev
    2013      Widget1         AM                         9  68  38.5  41.719300        -0.11 -0.11 -0.11  0.0
    2014      Widget1         AM                         2   6   4.0   2.828427        -0.11 -0.11 -0.11  0.0
    2015      Widget1         AM                         1  43  10.6  18.174157        -0.11 -0.11 -0.11  0.0
    

    PS the solution above assumes that the Ship_Date column is of datetime-like dtype