Suppose I have 2 Pandas dataframes df
with 297232 x 122
dimensions and df_raw
with 840380x122
dimensions. df
is already a subset of df_raw
. Both dataframes have the index as DateTime
. I would like to sample 70%
of values from df
, and 30%
of values from df_raw
(can be randomly sampled if need), while ensuring that the sampled dataframe subsets do not have overlaps in terms of indexes.
To be more precise, df_subset
will have 70%
randomly selected values from df
, and df_raw_subset
have 30%
randomly selected values from df_raw
, but df_subset
and df_raw_subset
should not contain overlaps in terms of rows which were sampled, i.e. they should have unique DateTime
indices.
So fist we sample
from df, since the size is small , when we drop it in the future from another bigger df , we will not have the problem : do not have enough data point to sample
df_sub=df.sample(frac=0.7, replace=False)
Then we drop the index in df_raw
by df_sub
n=int(len(df_raw)*0.3)
df_raw_sub=df_raw.drop(df_sub.index).sample(n,replace=False)