dataframe function for-loop if-statement pandas-groupby

Sort and verify many values simultaneously from DataFrame

I have this dataframe containing specific information from many firms. Each row represents the addition of a new public information that can be associated with any of the firms (there are around 4000 in the sample I believe). I have a column called 'company_ID' (specific to each firm) and another one 'date' which contains the dates of publication of the new info specified in the row. I want to verify if for each firm (each 'company_ID'), the dataframe contains 'date' going from 2012 to 2014). In other words, each firm should have information from 2012, 2013 and 2014 respectively in order to then calculate representative statistics.

I tried creating a dictionary, keys being 'company_ID' or groupby, but I find myself stuck every time.

Any idea how to solve this? Thank you!

Solution

Please try to provide a minimal reproducible example next time and the desired output next time

I created a small df with the information you provided, in case someone has a better answer:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Company_ID': np.random.randint(0,4,24),
                   'Year' : np.random.randint(2011,2015,24),
                   'Stock_Price': np.random.rand(24),
                   'Employees' : np.random.randint(700,800,24)})
df.dtypes
df['Company_ID']=df['Company_ID'].astype(str)
df=df.replace({'Company_ID' : { '0' : 'FB', '1' : 'CNN', '2' : 'Utube','3':'CricCroc' }})

So now back to your answer:

First I created a new df with the selected years:

df_select=df[df['Year'].isin([2012,2014])]

Now I am able to apply the groupby per company ID and per year and .agg can give me some statistics using numpy or other functions.

df_select.groupby(['Company_ID','Year']).agg({'Stock_Price' : (np.mean,np.size), 'Employees' : (np.min,np.max)})

___________________________Stock_Price      Employees     
                           mean size      amin amax
    Company_ID Year                                
    CNN        2014    0.422794    2       722  734
    CricCroc   2012    0.308549    1       736  736
    FB         2012    0.822580    2       754  772
    Utube      2012    0.324149    2       713  799
               2014    0.466509    2       733  781

Please note that I used random functions to generate the df. Values may vary when using my code.