Search code examples
pythonpandassubstring

seach for substring with minimum characters match pandas


I have 1st dataFrame with column 'X' as :

X
A468593-3
A697269-2
A561044-2
A239882 04

2nd dataFrame with column 'Y' as :

Y
000A561044          
000A872220   

I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded). so, my output DataFrame should be like this

X
A561044-2
   

Any possible solution would highly appreciate.

Thanks in advance.


Solution

  • IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.

    Code:

    import pandas as pd
    
    df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
    df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
    
    mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
    df1.loc[mask]
    

    Output:

        X
    2   A561044-2
    

    If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.

    def remove_first_numerics(s):
      counter = 0
      while s[counter].isnumeric():
        counter +=1
      return s[counter:]
    
    df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
    df_test["A"].apply(lambda s: remove_first_numerics(s))
    

    Output:

    0     Abd3Dc
    1     Adv3Dc
    2    d31oVgZ
    3      dZb1B
    4    CCcDx10
    Name: A, dtype: object