Search code examples
dataframefunctionfor-loopif-statementpandas-groupby

Sort and verify many values simultaneously from DataFrame


I have this dataframe containing specific information from many firms. Each row represents the addition of a new public information that can be associated with any of the firms (there are around 4000 in the sample I believe). I have a column called 'company_ID' (specific to each firm) and another one 'date' which contains the dates of publication of the new info specified in the row. I want to verify if for each firm (each 'company_ID'), the dataframe contains 'date' going from 2012 to 2014). In other words, each firm should have information from 2012, 2013 and 2014 respectively in order to then calculate representative statistics.

I tried creating a dictionary, keys being 'company_ID' or groupby, but I find myself stuck every time.

Any idea how to solve this? Thank you!


Solution

  • Please try to provide a minimal reproducible example next time and the desired output next time

    I created a small df with the information you provided, in case someone has a better answer:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'Company_ID': np.random.randint(0,4,24),
                       'Year' : np.random.randint(2011,2015,24),
                       'Stock_Price': np.random.rand(24),
                       'Employees' : np.random.randint(700,800,24)})
    df.dtypes
    df['Company_ID']=df['Company_ID'].astype(str)
    df=df.replace({'Company_ID' : { '0' : 'FB', '1' : 'CNN', '2' : 'Utube','3':'CricCroc' }})
    

    So now back to your answer:

    First I created a new df with the selected years:

    df_select=df[df['Year'].isin([2012,2014])]
    

    Now I am able to apply the groupby per company ID and per year and .agg can give me some statistics using numpy or other functions.

    df_select.groupby(['Company_ID','Year']).agg({'Stock_Price' : (np.mean,np.size), 'Employees' : (np.min,np.max)})
    

    ___________________________Stock_Price      Employees     
                               mean size      amin amax
        Company_ID Year                                
        CNN        2014    0.422794    2       722  734
        CricCroc   2012    0.308549    1       736  736
        FB         2012    0.822580    2       754  772
        Utube      2012    0.324149    2       713  799
                   2014    0.466509    2       733  781

    Please note that I used random functions to generate the df. Values may vary when using my code.