Search code examples
pythonpandasnumpygpustring-comparison

String comparison on elements in large arrays (> 11 mill)


Trying to compare different environments on id's (string). The largest array is 11,5 mill rows.

What I want to do is to combine all id's from the different arrays and then tell what environment each id is in

What I'm trying to create:

id (all ids of all arrays(df) prod preprod test dev
first string True True True False
second string False True False True
21gdfwg23fge True True False False
adf23 dfg a2-5 True False True True
Example 1
%%time
dfs = [prod,preprod,test,dev]
arrs = [np.array(df).flatten() for df in dfs]
all_ids = np.unique([item for sublist in arrs for item in sublist])
n_ids = len(all_ids)
n_arrs = len(arrs)
result = np.zeros((n_ids, n_arrs))
for i in range(n_arrs):
    result[:, i] = [1 if all_ids[j] in arrs[i] else 0 for j in range(n_ids)]
result_df = pd.DataFrame(result, columns=['dev', 'test', 'preprod', 'prod'], index = all_ids)
print(result_df)

Example 2: (easier to read)

final = pd.concat([prod, preprod, dev, test]).drop_duplicates()
final = final.dropna()

all = final

prod_num = prod.to_numpy()
preprod_num = prod.to_numpy()
dev_num = dev.to_numpy()
test_num = test.to_numpy()

prod_isin = np.isin(all,prod_num)
preprod_isin = np.isin(all,preprod_num)
dev_isin = np.isin(all, dev_num)
test_isin = np.isin(all, test_num)

final["prod"] = prod_isin
final["preprod"] = preprod_isin
final["dev"] = dev_isin
final["test"] = test_isin
print(final)

The code works but it's all too slow.

I've also tried GPU (without getting it to work (got an Nvidia 1080 TI)). Think what I should do is to sort the "main" list and then use some kind of sorthing algorithm to make the match.

All help will be appriciated


Solution

  • You can use pandas concatenation with index matching, it's very fast.

    prod = "aa|b|c|d|e".split('|')
    preprod = "c|d|e".split('|')
    test = "b|d|e|f".split('|')
    dev = "aa|e|g".split('|')
    
    df = pd.concat([
        pd.DataFrame({'prod': 1}, index=np.unique(prod)),
        pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
        pd.DataFrame({'test': 1}, index=np.unique(test)),
        pd.DataFrame({'dev': 1}, index=np.unique(dev))
    ], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
    print(df)
    
    >>> 
       id  prod  preprod  test  dev
    0  aa   1.0      0.0   0.0  1.0
    1   b   1.0      0.0   1.0  0.0
    2   c   1.0      1.0   0.0  0.0
    3   d   1.0      1.0   1.0  0.0
    4   e   1.0      1.0   1.0  1.0
    5   f   0.0      0.0   1.0  0.0
    6   g   0.0      0.0   0.0  1.0
    
    

    and for speed;

    prod = np.random.randint(10000000, size=10000000).astype(str)
    preprod = np.random.randint(10000000, size=1000000).astype(str)
    test = np.random.randint(10000000, size=1000000).astype(str)
    dev = np.random.randint(10000000, size=100000).astype(str)
    
    
    %%time
    df = pd.concat([
        pd.DataFrame({'prod': 1}, index=np.unique(prod)),
        pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
        pd.DataFrame({'test': 1}, index=np.unique(test)),
        pd.DataFrame({'dev': 1}, index=np.unique(dev))
    ], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
    
    >>> Wall time: 32.3 s
    
    

    on my humble laptop.