Search code examples
pythonpandasdataframecomparison

Pandas: Simple Analysis of growth (comparatively) and with Fillna


Below is the basic data I'm provided with every month. There are many department related files I get and the job gets very monotonous and repetitive.

Month,year,sales,  
January,2017,34400,  
February,2017,35530,  
March,2017,34920,  
April,2017,35950,  
May,2017,36230,  
June,2017,36820,  
July,2017,34590,  
August,2017,36500,  
September,2017,36600,  
October,2017,37140,  
November,2017,36790,  
December,2017,43500,  
January,2018,34900,  
February,2018,37700,  
March,2018,37900,  
April,2018,38100,  
May,2018,37800,  
June,2018,38500,  
July,2018,39400,  
August,2018,39700,  
September,2018,39980,  
October,2018,40600,  
November,2018,39100,  
December,2018,46600,  
January,2019,42500,  

I've tried to use certain functions like value_count(sadly, giving only summary) in order to achieve this output. And failed. (See output below.)

I need to autofill the 3rd and 4th columns (with fillna=True/False)

  • the third column is just telling if it is P/L compared to previous month (like if April is greater than March, then it is Profit.)
  • The fourth column is showing the sequence of P/L achieved, i.e. 2 months or 5months profit(/loss) in a row. (I mean continuously, as it results in certain awards/recognition for teams.)
  • The fifth column is the max sales achieved in the last 'n' number of months.

They only allow Apache OpenOffice for our job, and hence no Excel. But we have the permission by IT to install Python.

The solution in this Link is not helping me as they are grouping-by two columns. The columns in my output are inter-dependent.

import pandas as pd
df = pd.read_csv("Test_1.csv", "a")
df['comparative_position'] = df['sales'].diff().fillna=True
df.loc[df['comparative_position'] > 0.0, 'comparative_position'] = "Profit" 
df.loc[df['comparative_position'] < 0.0, 'comparative_position'] = "Loss" 

Month,Year,Sales,comparative_position,Months_in_P(or)L,Highest_in_12Months  
January,2016,34400,NaN,NaN,NaN  
February,2016,35530,Profit,1,NaN  
March,2016,34920,Loss,1,NaN  
April,2016,35950,Profit,1,NaN  
May,2016,36230,Profit,2,NaN  
June,2016,36820,Profit,3,NaN  
July,2016,34590,Loss,1,NaN  
August,2016,36500,Profit,1,NaN  
September,2016,36600,Profit,2,NaN  
October,2016,37140,Profit,3,NaN  
November,2016,36790,Loss,1,NaN  
December,2016,43500,Profit,1,43500  
January,2017,34900,Loss,1,43500  
February,2017,37700,Profit,1,43500  
March,2017,37900,Profit,2,43500  
April,2017,38100,Profit,3,43500  
May,2017,37800,Loss,1,43500  
June,2017,38500,Profit,1,43500  
July,2017,39400,Profit,2,43500  
August,2017,39700,Profit,3,43500  
September,2017,39980,Profit,4,43500  
October,2017,40600,Profit,5,43500  
November,2017,39100,Loss,1,43500  
December,2017,46600,Profit,1,46600  
January,2018,42500,Loss,1,46600 

Solution

  • AFAIU this should work for you:

    # Get difference from previous as True / False
    df['P/L'] = df.sales > df.sales.shift()
    # Add column counting 'streaks' of P or L
    df['streak'] = df['P/L'].groupby(df['P/L'].ne(df['P/L'].shift()).cumsum()).cumcount()
    # map True/False to string of Profit/Loss
    df['P/L'] = df['P/L'].map({True:'Profit', False:'Loss'})
    # max of last n months where n is 12, as in your example, you can change it to any int
    df['12_max'] = df.sales.rolling(12).max()
    

    Output:

        Month      year     sales       P/L  streak  12_max
    0   January     2017    34400       False   0   NaN
    1   February    2017    35530       True    0   NaN
    2   March       2017    34920       False   0   NaN
    3   April       2017    35950       True    0   NaN
    4   May         2017    36230       True    1   NaN
    5   June        2017    36820       True    2   NaN
    6   July        2017    34590       False   0   NaN
    7   August      2017    36500       True    0   NaN
    8   September   2017    36600       True    1   NaN
    9   October     2017    37140       True    2   NaN
    10  November    2017    36790       False   0   NaN
    11  December    2017    43500       True    0   43500.0
    12  January     2018    34900       False   0   43500.0
    13  February    2018    37700       True    0   43500.0
    14  March       2018    37900       True    1   43500.0
    15  April       2018    38100       True    2   43500.0
    16  May         2018    37800       False   0   43500.0
    17  June        2018    38500       True    0   43500.0
    18  July        2018    39400       True    1   43500.0
    19  August      2018    39700       True    2   43500.0
    20  September   2018    39980       True    3   43500.0
    21  October     2018    40600       True    4   43500.0
    22  November    2018    39100       False   0   43500.0
    23  December    2018    46600       True    0   46600.0
    24  January     2019    42500       False   0   46600.0