Search code examples
pythonpandassample

Using pandas how to use column data field for random sample


I know how to randomly sample few rows from a pandas data frame.

using sample command

df_sample = df.sample(n=10)

However what I need is random column(i.e Village) from the below data frame.

Dummy Data:

For example : I want to randomly select 3 Villages entire data, i.e Village A, B & C. Village A,B & C will be randomly selected and give us output for entire data for this 3 villages.

likewise,

enter image description here

Here is my code

>>> import pandas as pd
>>> import numpy as np
>>> df=pd.read_excel("/home/Study.xlsx")
>>> df=df.sample(n=3)
>>> df
    Sr.No  ...  Village
16     17  ...        I
33     34  ...        Q
36     37  ...        S

So, I need that, if village I , Q and S are randomly selected, than i required entire data for this 3 villages.

Thanks.


Solution

  • Use numpy.random.choice with unique values for random 3 villages and then filter by Series.isin and boolean indexing:

    vil = np.random.choice(df['Village'].unique(), 3)
    df = df[df['Village'].isin(vil)]
    

    Pandas only solution with Series.drop_duplicates and Series.sample:

    vil = df['Village'].drop_duplicates().sample(3)
    df = df[df['Village'].isin(vil)]
    

    For functions use:

    def random_vil(x):
        vil = np.random.choice(df['Village'].unique(), x)
        return df[df['Village'].isin(vil)]
    
     df = random_vil(3)