python pandas numpy gpu string-comparison

String comparison on elements in large arrays (> 11 mill)

Trying to compare different environments on id's (string). The largest array is 11,5 mill rows.

What I want to do is to combine all id's from the different arrays and then tell what environment each id is in

What I'm trying to create:

id (all ids of all arrays(df)	prod	preprod	test	dev
first string	True	True	True	False
second string	False	True	False	True
21gdfwg23fge	True	True	False	False
adf23 dfg a2-5	True	False	True	True

Example 1

%%time
dfs = [prod,preprod,test,dev]
arrs = [np.array(df).flatten() for df in dfs]
all_ids = np.unique([item for sublist in arrs for item in sublist])
n_ids = len(all_ids)
n_arrs = len(arrs)
result = np.zeros((n_ids, n_arrs))
for i in range(n_arrs):
    result[:, i] = [1 if all_ids[j] in arrs[i] else 0 for j in range(n_ids)]
result_df = pd.DataFrame(result, columns=['dev', 'test', 'preprod', 'prod'], index = all_ids)
print(result_df)

Example 2: (easier to read)

final = pd.concat([prod, preprod, dev, test]).drop_duplicates()
final = final.dropna()

all = final

prod_num = prod.to_numpy()
preprod_num = prod.to_numpy()
dev_num = dev.to_numpy()
test_num = test.to_numpy()

prod_isin = np.isin(all,prod_num)
preprod_isin = np.isin(all,preprod_num)
dev_isin = np.isin(all, dev_num)
test_isin = np.isin(all, test_num)

final["prod"] = prod_isin
final["preprod"] = preprod_isin
final["dev"] = dev_isin
final["test"] = test_isin
print(final)

The code works but it's all too slow.

I've also tried GPU (without getting it to work (got an Nvidia 1080 TI)). Think what I should do is to sort the "main" list and then use some kind of sorthing algorithm to make the match.

All help will be appriciated

Solution

You can use pandas concatenation with index matching, it's very fast.

prod = "aa|b|c|d|e".split('|')
preprod = "c|d|e".split('|')
test = "b|d|e|f".split('|')
dev = "aa|e|g".split('|')

df = pd.concat([
    pd.DataFrame({'prod': 1}, index=np.unique(prod)),
    pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
    pd.DataFrame({'test': 1}, index=np.unique(test)),
    pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
print(df)

>>> 
   id  prod  preprod  test  dev
0  aa   1.0      0.0   0.0  1.0
1   b   1.0      0.0   1.0  0.0
2   c   1.0      1.0   0.0  0.0
3   d   1.0      1.0   1.0  0.0
4   e   1.0      1.0   1.0  1.0
5   f   0.0      0.0   1.0  0.0
6   g   0.0      0.0   0.0  1.0

and for speed;

prod = np.random.randint(10000000, size=10000000).astype(str)
preprod = np.random.randint(10000000, size=1000000).astype(str)
test = np.random.randint(10000000, size=1000000).astype(str)
dev = np.random.randint(10000000, size=100000).astype(str)


%%time
df = pd.concat([
    pd.DataFrame({'prod': 1}, index=np.unique(prod)),
    pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
    pd.DataFrame({'test': 1}, index=np.unique(test)),
    pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})

>>> Wall time: 32.3 s

on my humble laptop.