Search code examples
pythonpandasdataframesample

How to randomly sample and keep only n values of repeating IDs?


I have a data frame that looks like this:

user_id tweet_id tweet
user123 7658j dogs are super
user245 66721 yes dogs are super
user245 6d343 yes cats are also super
<...> <...> <...>
user245 541238 well I developed allergy on cates

As I check value counts for each user, I have the following results:

id count
user245 456
user123 115
user427 2

I want to subset the data this way that I keep all rows of ids with value counts below 100, and keep 100 randomly sampled rows of the rows with ids where value counts is above 100?


Solution

  • You can try:

    (df.groupby('user_id', group_keys=False)
       .apply(lambda g: g.sample(n=min(len(g), 100)))
    )
    

    Example (with n=3):

    df = pd.DataFrame({'id': list('AAAAAABBCDDDD'), 'col': range(13)})
    (df.groupby('id', group_keys=False)
       .apply(lambda g: g.sample(n=min(len(g), 3)))
    )
    

    Output:

       id  col
    0   A    0
    4   A    4
    3   A    3
    7   B    7
    6   B    6
    8   C    8
    12  D   12
    11  D   11
    9   D    9