Search code examples
pythonpandasdataframesample

Randomly sample from dataframe based on condition without losing data


I've been doing this in VBA but would really appreciate advice on how to do it pythonically. I've got an Excel sheet with 8000+ rows and >25 columns. I need to mark certain rows as requiring close review by a human. Basically, for each individual in the 'Name' column, we need to randomly flag 10% of that person's rows as needing review. We don't want to lose/delete/suppress the other rows; they need to remain available--that's why I'm thinking df.sample wouldn't work. In the actual data, there will be 20-30 unique Names, each with 300-400 rows.

So far I've used pandas to read the data into a dataframe, done some stuff which isn't germane to this question, and written the dataframe back into an Excel file. As part of this, I've also created a 'Random_No.' column in the dataframe, thinking this would be a useful step towards my goal (''based on how I was doing it in VBA)...maybe invoking something like this for each Name.

There's probably a million ways to do this, and I've been fiddling with various approaches, but I'd really appreciate some advice on what you think the most efficient way would be. I seem to be creating a lot of 'helper' dataframes and series and ArrayOfObjects...and in general making everything more complicated than it probably needs to be. Is there a simple way to do this within the dataframe as opposed to making a newbie-Python mess?

Here's a simplified schema of the data; the 'Needs_review' represents the kind of column I'm trying to create--again, 10% for each Name. As always thanks for any advice/direction.

![simplified view of data


Solution

  • Using side effect

    df = pd.DataFrame({
        'Name': [f"Name_{i}" for i in np.random.randint(0,10,10000)],
        'Col1': np.random.randn(10000)})
    
    need_review = []
    df.groupby(['Name']).agg(
        lambda x: need_review.extend(
            np.random.choice(
                x.index, int(0.1*len(x.index)), replace=False).tolist())).unstack()
    df['Needs_Review'] = False
    df.loc[need_review, 'Needs_Review'] = True
    
    print (df.groupby(['Name', 'Needs_Review'])['Needs_Review'].count())
    

    Output:

    Name    Needs_Review
    Name_0  False           871
            True             96
    Name_1  False           925
            True            102
    Name_2  False           895
            True             99
    Name_3  False           890
            True             98
    Name_4  False           842
            True             93
    Name_5  False           932
            True            103
    Name_6  False           911
            True            101
    Name_7  False           932
            True            103
    Name_8  False           909
            True            101
    Name_9  False           898
            True             99