Search code examples
pythonpandasstatistics-bootstrap

How to increase the sample size by bootstrapping in python?


I try to implement poisson regressions on one dataset in python. And after that I would like to bootstrap the dataset to increase the sample size. But when I use bootstrap function from spicy I got an error that say

Percentiles must be in the range [0, 100].

Anyone can help how to perform the bootstrap on this dataset? Here is my code:

df = pd.read_csv('/content/ships.txt',sep='\s+')
df.to_csv('/content/ship2.csv',index=None)
data = pd.read_csv('/content/ship2.csv',skiprows=1,sep=',',names=['type','construction','operation','months','damage'])
dat = pd.get_dummies(data)
data_boot = bootstrap(dat,np.mean, n_resamples=100)

# ValueError: Percentiles must be in the range [0, 100]

Solution

  • As far as I understood your point, you would like to increase the volume of your data by duplicating a subsample of them (usually it not the best in Data Science, you totally should consider in using an oversampling method, like SMOTE). Since your question is on duplication, I suggest you to sample your dataset and concatenate the result to the initial df. Here's the code to do it

    data1 = pd.read_csv('/content/ship2.csv',skiprows=1,sep=',',names=['type','construction','operation','months','damage'])
    data2 = data1.sample(frac=0.1) # if you want to select a fraction, otherwise consider to substitute the "frac" parameter with "n"
    data = pd.concat([data1, data2], axis=1)
    data = data.sample(frac=1) # if you want to shuffle the increased dataset