Search code examples
pythonpandasdataframedatetimesampling

Extract subset from pandas dataframes ensuring no overlap?


Suppose I have 2 Pandas dataframes df with 297232 x 122 dimensions and df_raw with 840380x122 dimensions. df is already a subset of df_raw. Both dataframes have the index as DateTime. I would like to sample 70% of values from df, and 30% of values from df_raw (can be randomly sampled if need), while ensuring that the sampled dataframe subsets do not have overlaps in terms of indexes.

To be more precise, df_subset will have 70% randomly selected values from df, and df_raw_subset have 30% randomly selected values from df_raw, but df_subset and df_raw_subset should not contain overlaps in terms of rows which were sampled, i.e. they should have unique DateTime indices.


Solution

  • So fist we sample from df, since the size is small , when we drop it in the future from another bigger df , we will not have the problem : do not have enough data point to sample

    df_sub=df.sample(frac=0.7, replace=False)
    

    Then we drop the index in df_raw by df_sub

    n=int(len(df_raw)*0.3)
    df_raw_sub=df_raw.drop(df_sub.index).sample(n,replace=False)