Trying to compare different environments on id's (string). The largest array is 11,5 mill rows.
What I want to do is to combine all id's from the different arrays and then tell what environment each id is in
What I'm trying to create:
id (all ids of all arrays(df) | prod | preprod | test | dev |
---|---|---|---|---|
first string | True | True | True | False |
second string | False | True | False | True |
21gdfwg23fge | True | True | False | False |
adf23 dfg a2-5 | True | False | True | True |
%%time
dfs = [prod,preprod,test,dev]
arrs = [np.array(df).flatten() for df in dfs]
all_ids = np.unique([item for sublist in arrs for item in sublist])
n_ids = len(all_ids)
n_arrs = len(arrs)
result = np.zeros((n_ids, n_arrs))
for i in range(n_arrs):
result[:, i] = [1 if all_ids[j] in arrs[i] else 0 for j in range(n_ids)]
result_df = pd.DataFrame(result, columns=['dev', 'test', 'preprod', 'prod'], index = all_ids)
print(result_df)
Example 2: (easier to read)
final = pd.concat([prod, preprod, dev, test]).drop_duplicates()
final = final.dropna()
all = final
prod_num = prod.to_numpy()
preprod_num = prod.to_numpy()
dev_num = dev.to_numpy()
test_num = test.to_numpy()
prod_isin = np.isin(all,prod_num)
preprod_isin = np.isin(all,preprod_num)
dev_isin = np.isin(all, dev_num)
test_isin = np.isin(all, test_num)
final["prod"] = prod_isin
final["preprod"] = preprod_isin
final["dev"] = dev_isin
final["test"] = test_isin
print(final)
The code works but it's all too slow.
I've also tried GPU (without getting it to work (got an Nvidia 1080 TI)). Think what I should do is to sort the "main" list and then use some kind of sorthing algorithm to make the match.
All help will be appriciated
You can use pandas concatenation with index matching, it's very fast.
prod = "aa|b|c|d|e".split('|')
preprod = "c|d|e".split('|')
test = "b|d|e|f".split('|')
dev = "aa|e|g".split('|')
df = pd.concat([
pd.DataFrame({'prod': 1}, index=np.unique(prod)),
pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
pd.DataFrame({'test': 1}, index=np.unique(test)),
pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
print(df)
>>>
id prod preprod test dev
0 aa 1.0 0.0 0.0 1.0
1 b 1.0 0.0 1.0 0.0
2 c 1.0 1.0 0.0 0.0
3 d 1.0 1.0 1.0 0.0
4 e 1.0 1.0 1.0 1.0
5 f 0.0 0.0 1.0 0.0
6 g 0.0 0.0 0.0 1.0
and for speed;
prod = np.random.randint(10000000, size=10000000).astype(str)
preprod = np.random.randint(10000000, size=1000000).astype(str)
test = np.random.randint(10000000, size=1000000).astype(str)
dev = np.random.randint(10000000, size=100000).astype(str)
%%time
df = pd.concat([
pd.DataFrame({'prod': 1}, index=np.unique(prod)),
pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
pd.DataFrame({'test': 1}, index=np.unique(test)),
pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
>>> Wall time: 32.3 s
on my humble laptop.