I've been doing this in VBA but would really appreciate advice on how to do it pythonically.
I've got an Excel sheet with 8000+ rows and >25 columns. I need to mark certain rows as requiring close review by a human. Basically, for each individual in the 'Name' column, we need to randomly flag 10% of that person's rows as needing review. We don't want to lose/delete/suppress the other rows; they need to remain available--that's why I'm thinking df.sample
wouldn't work. In the actual data, there will be 20-30 unique Names, each with 300-400 rows.
So far I've used pandas to read the data into a dataframe, done some stuff which isn't germane to this question, and written the dataframe back into an Excel file. As part of this, I've also created a 'Random_No.' column in the dataframe, thinking this would be a useful step towards my goal (''based on how I was doing it in VBA)...maybe invoking something like this for each Name.
There's probably a million ways to do this, and I've been fiddling with various approaches, but I'd really appreciate some advice on what you think the most efficient way would be. I seem to be creating a lot of 'helper' dataframes and series and ArrayOfObjects...and in general making everything more complicated than it probably needs to be. Is there a simple way to do this within the dataframe as opposed to making a newbie-Python mess?
Here's a simplified schema of the data; the 'Needs_review' represents the kind of column I'm trying to create--again, 10% for each Name. As always thanks for any advice/direction.
Using side effect
df = pd.DataFrame({
'Name': [f"Name_{i}" for i in np.random.randint(0,10,10000)],
'Col1': np.random.randn(10000)})
need_review = []
df.groupby(['Name']).agg(
lambda x: need_review.extend(
np.random.choice(
x.index, int(0.1*len(x.index)), replace=False).tolist())).unstack()
df['Needs_Review'] = False
df.loc[need_review, 'Needs_Review'] = True
print (df.groupby(['Name', 'Needs_Review'])['Needs_Review'].count())
Output:
Name Needs_Review
Name_0 False 871
True 96
Name_1 False 925
True 102
Name_2 False 895
True 99
Name_3 False 890
True 98
Name_4 False 842
True 93
Name_5 False 932
True 103
Name_6 False 911
True 101
Name_7 False 932
True 103
Name_8 False 909
True 101
Name_9 False 898
True 99