Search code examples
pythonpandasdataframerandomsample

Sampling rows with sample size greater than length of DataFrame


I'm being asked to generate a new variable based on the data from an old one. Basically, what is being asked is that I take values at random (by using the random function) from the original one and have at least 10x as many observations as the old one, and then save this as a new variable.

This is my dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv

The variable I wanna work with, is area

This is my attempt but it is giving me a module object is not callable error:

import pandas as pd
import random as rand

dataFrame = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")

area = dataFrame['area']

random_area = rand(area)

print(random_area)

Solution

  • You can use the sample function with replace=True:

    df = df.sample(n=len(df) * 10, replace=True)
    

    Or, to sample only the area column, use

    area = df.area.sample(n=len(df) * 10, replace=True)
    

    Another option would involve np.random.choice, and would look something like:

    df = df.iloc[np.random.choice(len(df), len(df) * 10)]
    

    The idea is to generate random indices from 0-len(df)-1. The first argument specifies the upper bound and the second (len(df) * 10) specifies the number of indices to generate. We then use the generated indices to index into df.

    If you just want to get the area, this is sufficient.

    area = df.iloc[np.random.choice(len(df), len(df) * 10), df.columns.get_loc('area')]
    

    Index.get_loc converts the "area" label to position, for iloc.


    df = pd.DataFrame({'A': list('aab'), 'B': list('123')})
    df
       A  B
    0  a  1
    1  a  2
    2  b  3
    
    # Sample 3 times the original size
    df.sample(n=len(df) * 3, replace=True)
    
       A  B
    2  b  3
    1  a  2
    1  a  2
    2  b  3
    1  a  2
    0  a  1
    0  a  1
    2  b  3
    2  b  3
    
    df.iloc[np.random.choice(len(df), len(df) * 3)]
    
       A  B
    0  a  1
    1  a  2
    1  a  2
    0  a  1
    2  b  3
    0  a  1
    0  a  1
    0  a  1
    2  b  3