I have a large DF (~35 million rows) and I am trying to create a new df by randomly sampling two rows from each unique cluster ID (~1.8 million unique cluster IDs)-- one row must have a label 0 and one row must have a label 1 (sometimes there is only one label, so I must check first if both labels are present within the cluster). For reference, my dataset has 3 main cols: 'embeddings', 'cluster_ID', 'label'.
I'm finding that this process takes way more time than I anticipated (more than 15 hours), and I want to know if there is a way to optimize my code.
Please NOTE: My original df only has 2 labels: 1 and 0. I should end up with more than 1.8 million rows. I would like to sample 2 rows for each cluster (one label 1 and other is label 0). Some clusters have a mixed bag of rows with both labels-- but some only have rows with label 1. If the former is the case: I want 2 rows from that cluster (one for each label)-- if the latter is the case: I only want one row from that cluster. Therefore: I should end up with between 1.8-3.6 million rows.
import pandas as pd
import random
# Create an empty list to store selected rows
selected_rows = []
# Iterate over unique cluster IDs
for cluster_id in result_df_30['cluster_ID'].unique():
# Filter the DataFrame for the current cluster ID
df_cluster = result_df_30[result_df_30['cluster_ID'] == cluster_id]
# Filter rows with label 0 and 1
df_label_0 = df_cluster[df_cluster['label'] == 0]
df_label_1 = df_cluster[df_cluster['label'] == 1]
# Sample rows if they exist
if not df_label_0.empty:
sample_label_0 = df_label_0.sample(n=1, random_state=42)
selected_rows.append(sample_label_0)
if not df_label_1.empty:
sample_label_1 = df_label_1.sample(n=1, random_state=42)
selected_rows.append(sample_label_1)
# Concatenate the selected rows into a single DataFrame
selected_rows_df = pd.concat(selected_rows)
selected_rows_df
sample = (
df.sample(frac=1)
.groupby(["cluster_ID", "label"], as_index=False)
.first()
)
This requires that your dataframe be sequentially indexed (0, 1, ..., 35m). That is super easy with df = df.reset_index(drop=True)
.
# Some test data
n = 35_000_000
df = pd.DataFrame(
{
"embeddings": np.random.rand(n),
"cluster_ID": np.random.randint(0, 1_800_000, n),
"label": np.random.randint(0, 2, n),
}
)
tmp = (
# since you only care about rows with label 0 or 1
df[df["label"].isin([0, 1])]
# shuffle the rows. This is where the randomness comes from
.sample(frac=1)
# reset the index
.reset_index()
# for each Cluster ID and label, get the number of the first row
.pivot_table(index="cluster_ID", columns="label", values="index", aggfunc="first")
)
# Filter for clusters that have both 0 and 1, then get their row numbers
idx = tmp[tmp.notna().all(axis=1)].to_numpy("int").flatten()
# and your random sample
sample = df.iloc[idx]