I have this dataframe containing specific information from many firms. Each row represents the addition of a new public information that can be associated with any of the firms (there are around 4000 in the sample I believe). I have a column called 'company_ID' (specific to each firm) and another one 'date' which contains the dates of publication of the new info specified in the row. I want to verify if for each firm (each 'company_ID'), the dataframe contains 'date' going from 2012 to 2014). In other words, each firm should have information from 2012, 2013 and 2014 respectively in order to then calculate representative statistics.
I tried creating a dictionary, keys being 'company_ID' or groupby, but I find myself stuck every time.
Any idea how to solve this? Thank you!
Please try to provide a minimal reproducible example next time and the desired output next time
I created a small df with the information you provided, in case someone has a better answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Company_ID': np.random.randint(0,4,24),
'Year' : np.random.randint(2011,2015,24),
'Stock_Price': np.random.rand(24),
'Employees' : np.random.randint(700,800,24)})
df.dtypes
df['Company_ID']=df['Company_ID'].astype(str)
df=df.replace({'Company_ID' : { '0' : 'FB', '1' : 'CNN', '2' : 'Utube','3':'CricCroc' }})
So now back to your answer:
First I created a new df with the selected years:
df_select=df[df['Year'].isin([2012,2014])]
Now I am able to apply the groupby
per company ID and per year and .agg
can give me some statistics using numpy or other functions.
df_select.groupby(['Company_ID','Year']).agg({'Stock_Price' : (np.mean,np.size), 'Employees' : (np.min,np.max)})
___________________________Stock_Price Employees
mean size amin amax
Company_ID Year
CNN 2014 0.422794 2 722 734
CricCroc 2012 0.308549 1 736 736
FB 2012 0.822580 2 754 772
Utube 2012 0.324149 2 713 799
2014 0.466509 2 733 781
Please note that I used random functions to generate the df. Values may vary when using my code.