Suppose I have a dataset which contains labels, filenames, and potentially other columns of metadata. The dataset may have as many as 200,000 examples. I've provided a snippet below that simulates this setup.
import pandas as pd
import numpy as np
import IPython.display as ipd
size = 20000
df = []
rng = np.random.default_rng(0)
for i in range(size):
l = rng.choice(('cat', 'dog', 'mouse', 'bird', 'horse', 'lion', 'rabbit'))
fp = str(rng.integers(1e5)).zfill(6) + '.jpg'
df.append((l, fp))
df = pd.DataFrame(df, columns=['label', 'filepath'])
ipd.display(df)
I would like to efficiently produce N randomly generated pairs of data, with the condition that the dataset is balanced between positive and negative pairs, e.g.,
# df_out would be of size "N"
df_out = pd.DataFrame([], columns=['label_1', 'label_2', 'filepath_1', 'filepath_2'])
Here I am defining a positive pair as one where label_1 equals label_2, and a negative pair as one where the two labels are not equal. So the goal is for df_out
to contain roughly 50%-positive and 50%-negative pairs.
The first approach I tried works by sampling 2N rows from the DataFrame, then collapses them into pairs.
N = 20
ii = rng.permutation(np.arange(N*2)%len(df))
func = lambda x: x.dropna().astype(str).str.cat(sep=',')
df_out = df.iloc[ii].reset_index(drop=True) # subsample
df_out = df_out.groupby(df_out.index//2) # collapse every two rows into one row
df_out = df_out.agg(func).reset_index(drop=True) # use `func` to combine rows
for k in df.columns:
df_out[[f'{k}_1',f'{k}_2']] = df_out[k].str.split(',', expand=True)
del df_out[k]
So this works to make pairs of rows, but it doesn't take any consideration to positive or negative pairs.
# as one would expect, this percentage is not equal to 50%
print(sum(df_out.eval('label_1==label_2')) / N)
Here is an approach by shuffling the data and grouping the rows either:
Then pivoting the data and sampling again randomly.
N = 100 # number of rows to pick (half positive, half negative)
#### positive pairs
df2 = (df.sample(frac=1)
.assign(n=lambda d: d.groupby('label').cumcount(),
n2=lambda d: d['n'].floordiv(2),
col=lambda d: d['n'].mod(2).add(1),
)
)
positives = (df2[df2.duplicated(['label', 'n2'], keep=False)]
.reset_index()
.pivot(index=['n2', 'label'], columns='col', values=['label', 'filenames', 'index'])
.sample(n=N//2)
.reset_index(drop=True)
)
positive_idx = positives.pop('index').stack().values
#### negative pairs
negatives = (
df.drop(positive_idx) # comment the "drop" if you don't want to exclude row picked above
.sample(frac=1)
.assign(n=lambda d: d.groupby('label').cumcount(),
g=lambda d: d.groupby('n').cumcount().floordiv(2),
col=lambda d: d.groupby('n').cumcount().mod(2).add(1),
)
.pivot(index=['n', 'g'], columns='col', values=['label', 'filenames'])
.dropna().sample(n=N//2)
.reset_index(drop=True)
)
out = pd.concat({'positives': positives, 'negatives': negatives})
print(out)
Output:
label filenames
col 1 2 1 2
positives 0 bird bird 095459.jpg 026617.jpg
1 horse horse 062451.jpg 027905.jpg
2 rabbit rabbit 067629.jpg 065238.jpg
3 horse horse 024818.jpg 026751.jpg
4 cat cat 007291.jpg 048994.jpg
... ... ... ... ...
negatives 45 rabbit cat 010290.jpg 044769.jpg
46 mouse bird 016260.jpg 098423.jpg
47 mouse horse 044362.jpg 065754.jpg
48 dog cat 085628.jpg 058504.jpg
49 horse bird 061706.jpg 025309.jpg
[100 rows x 4 columns]