I have a very large arrow dataset (181GB, 30m rows) from the huggingface framework I've been using. I want to randomly sample with replacement 100 rows (20 times), but after looking around, I cannot find a clear way to do this. I've tried converting to a pd.Dataframe so that I can use df.sample(), but python crashes everytime (assuming due to large dataset). So, I'm looking for something built-in within pyarrow.
df = Dataset.from_file("embeddings_job/combined_embeddings_small/data-00000-of-00001.arrow")
df=df.to_table().to_pandas() #crashes at this line
random_sample = df.sample(n=100)
Some ideas: not sure if this is w/replacement
import numpy as np
random_indices = np.random.randint(0, len(df), size=100)
# Take the samples from the dataset
sampled_table = df.select(random_indices)
Using huggingface shuffle
sample_size = 100
# Shuffle the dataset
shuffled_dataset = df.shuffle()
# Select the first 100 rows
sampled_dataset = df.select(range(sample_size))
Is the only other way through terminal commands? Would this be correct:
for i in {1..30}; do shuf -n 1000 -r file > sampled_$i.txt; done
After getting each chunk, the plan is to run each chunk through a random forest algoritm. What is the best way to go about this?
Also, I would like to note that whatever solution should make sure the indices do not get reset when I get the output subset.
A bit late, but I just had to write a function to randomly sample a pyarrow Table. It produces the sample directly from a pyarrow Table without converting to a pandas dataframe.
def sample_table(table: pa.Table, n_sample_rows: int = None) -> pa.Table:
if n_sample_rows is None or n_sample_rows >= table.num_rows:
return table
indices = random.sample(range(table.num_rows), k=n_sample_rows)
return table.take(indices)