Search code examples
python-3.xstringdataframesearchstring-concatenation

String Search in two different Dataframes


I have a list of strings in a dataframe df1

temp=df1.loc[df1['quartier']=='quartier1', 'Titre']

The output is:

108      T/42322/C/
108      T/37666/C/
108      T/31411/C/
108    T/118487/01/
108    T/118489/01/
           ...     
108    T/108034/01/
108    T/108045/01/
108    T/108047/01/
108    T/108032/01/
108    T/108045/01/

Shortlist=','.join(temp)

T/42322/C/,T/37666/C/,T/31411/C/,....

and I would like to know how many are present in df2:

temp_ilot=df2.loc[df2['quartier']=='quartier1', 'Titre_test']

The output of df2 is:

4665                                                  NaN
4666                                                  NaN
4667                                                  NaN
4668                                                  NaN
4669    T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
                              ...                        
4800                                                  NaN
4804                                                  NaN
4888                                                  NaN
4890                                                  NaN

Any idea would be helpful.

I tried this code but I get a false response when I am sure that the information is there.

temp_ilot.str.contains(Shortlist)

4665      NaN
4666      NaN
4667      NaN
4668      NaN
4669    False
        ...  
4800      NaN
4804      NaN
4888      NaN
4890      NaN
4907      NaN
Name: Titre_Appart, Length: 84, dtype: object

Solution

  • When you use pd.Series.str.contains, you're asking if the series temp_ilot contains that string as is. In your question you ask:

    and I would like to know how many are present in df2

    If you want to see how many are present

    You'll need to treat each part of the string as an individual rather than one single string.

    from io import StringIO
    
    import numpy as np
    import pandas as pd
    
    df = pd.read_csv(
        StringIO("""108      T/42322/C/
        108      T/37666/C/
        108      T/31411/C/
        108    T/118487/01/
        108    T/118489/01/
        108    T/108034/01/
        108    T/108045/01/
        108    T/108047/01/
        108    T/108032/01/
        108    T/108045/01/"""),
        sep=r"\s+",
        index_col=0,
        names=["Titre"],
        header=None,
    )
    
    ser = df.Titre
    
    df2 = pd.read_csv(
        StringIO("""4665                                                  NaN
    4666                                                  NaN
    4667                                                  NaN
    4668                                                  NaN
    4669    T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
    4800                                                  NaN
    4804                                                  NaN
    4888                                                  NaN
    4890                                                  NaN"""),
        sep=r"\s+",
        names=["Titre_test"],
        index_col=[0],
        header=None,
    )
    
    ser2 = df2.Titre_test
    
    # drop NaN, split on comma, write to list, and convert to array
    arr = np.array(ser2.dropna().str.split(",").to_list())
    
    # check if each element is in `arr` 
    # and sum to determine *how many*
    ser.apply(lambda x: np.isin(x, arr)).sum()
    
    # an example
    eg = pd.Series(["T/61701/01/", "abc", "test"])
    
    out = eg.apply(lambda x: np.isin(x, arr))
    print(out)
    # print(out.sum())  # >>> 1
    
    0     True
    1    False
    2    False
    dtype: bool
    

    If you want to see if any are present

    You could re-use your code but join with a pipe instead of a comma. This allows the regex "or" operator to be used.

    from io import StringIO
    
    import pandas as pd
    
    df = pd.read_csv(
        StringIO("""108      T/42322/C/
        108      T/37666/C/
        108      T/31411/C/
        108    T/118487/01/
        108    T/118489/01/
        108    T/108034/01/
        108    T/108045/01/
        108    T/108047/01/
        108    T/108032/01/
        108    T/108045/01/"""),
        sep=r"\s+",
        index_col=0,
        names=["Titre"],
        header=None,
    )
    
    ser = df.Titre
    
    # use a pipe (|) instead of a comma for regex "or"
    shortlist = "|".join(ser)
    
    df2 = pd.read_csv(
        StringIO("""4665                                                  NaN
    4666                                                  NaN
    4667                                                  NaN
    4668                                                  NaN
    4669    T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
    4800                                                  NaN
    4804                                                  NaN
    4888                                                  NaN
    4890                                                  NaN"""),
        sep=r"\s+",
        names=["Titre_test"],
        index_col=[0],
        header=None,
    )
    
    ser2 = df2.Titre_test
    
    # regex is True by default
    ser2.str.contains(shortlist, regex=True)
    
    # as an example...
    out = ser2.str.contains(
        "T/61701/01/|abc|test",
        regex=True
    )
    print(out)
    
    4665     NaN
    4666     NaN
    4667     NaN
    4668     NaN
    4669    True
    4800     NaN
    4804     NaN
    4888     NaN
    4890     NaN
    Name: Titre_test, dtype: object