python-3.x string dataframe search string-concatenation

String Search in two different Dataframes

I have a list of strings in a dataframe df1

temp=df1.loc[df1['quartier']=='quartier1', 'Titre']

The output is:

108      T/42322/C/
108      T/37666/C/
108      T/31411/C/
108    T/118487/01/
108    T/118489/01/
           ...     
108    T/108034/01/
108    T/108045/01/
108    T/108047/01/
108    T/108032/01/
108    T/108045/01/

Shortlist=','.join(temp)

T/42322/C/,T/37666/C/,T/31411/C/,....

and I would like to know how many are present in df2:

temp_ilot=df2.loc[df2['quartier']=='quartier1', 'Titre_test']

The output of df2 is:

4665                                                  NaN
4666                                                  NaN
4667                                                  NaN
4668                                                  NaN
4669    T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
                              ...                        
4800                                                  NaN
4804                                                  NaN
4888                                                  NaN
4890                                                  NaN

Any idea would be helpful.

I tried this code but I get a false response when I am sure that the information is there.

temp_ilot.str.contains(Shortlist)

4665      NaN
4666      NaN
4667      NaN
4668      NaN
4669    False
        ...  
4800      NaN
4804      NaN
4888      NaN
4890      NaN
4907      NaN
Name: Titre_Appart, Length: 84, dtype: object

Solution

When you use pd.Series.str.contains, you're asking if the series temp_ilot contains that string as is. In your question you ask:

and I would like to know how many are present in df2

If you want to see how many are present

You'll need to treat each part of the string as an individual rather than one single string.

from io import StringIO

import numpy as np
import pandas as pd

df = pd.read_csv(
    StringIO("""108      T/42322/C/
    108      T/37666/C/
    108      T/31411/C/
    108    T/118487/01/
    108    T/118489/01/
    108    T/108034/01/
    108    T/108045/01/
    108    T/108047/01/
    108    T/108032/01/
    108    T/108045/01/"""),
    sep=r"\s+",
    index_col=0,
    names=["Titre"],
    header=None,
)

ser = df.Titre

df2 = pd.read_csv(
    StringIO("""4665                                                  NaN
4666                                                  NaN
4667                                                  NaN
4668                                                  NaN
4669    T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
4800                                                  NaN
4804                                                  NaN
4888                                                  NaN
4890                                                  NaN"""),
    sep=r"\s+",
    names=["Titre_test"],
    index_col=[0],
    header=None,
)

ser2 = df2.Titre_test

# drop NaN, split on comma, write to list, and convert to array
arr = np.array(ser2.dropna().str.split(",").to_list())

# check if each element is in `arr` 
# and sum to determine *how many*
ser.apply(lambda x: np.isin(x, arr)).sum()

# an example
eg = pd.Series(["T/61701/01/", "abc", "test"])

out = eg.apply(lambda x: np.isin(x, arr))
print(out)
# print(out.sum())  # >>> 1

0     True
1    False
2    False
dtype: bool

If you want to see if any are present

You could re-use your code but join with a pipe instead of a comma. This allows the regex "or" operator to be used.

from io import StringIO

import pandas as pd

df = pd.read_csv(
    StringIO("""108      T/42322/C/
    108      T/37666/C/
    108      T/31411/C/
    108    T/118487/01/
    108    T/118489/01/
    108    T/108034/01/
    108    T/108045/01/
    108    T/108047/01/
    108    T/108032/01/
    108    T/108045/01/"""),
    sep=r"\s+",
    index_col=0,
    names=["Titre"],
    header=None,
)

ser = df.Titre

# use a pipe (|) instead of a comma for regex "or"
shortlist = "|".join(ser)

df2 = pd.read_csv(
    StringIO("""4665                                                  NaN
4666                                                  NaN
4667                                                  NaN
4668                                                  NaN
4669    T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
4800                                                  NaN
4804                                                  NaN
4888                                                  NaN
4890                                                  NaN"""),
    sep=r"\s+",
    names=["Titre_test"],
    index_col=[0],
    header=None,
)

ser2 = df2.Titre_test

# regex is True by default
ser2.str.contains(shortlist, regex=True)

# as an example...
out = ser2.str.contains(
    "T/61701/01/|abc|test",
    regex=True
)
print(out)

4665     NaN
4666     NaN
4667     NaN
4668     NaN
4669    True
4800     NaN
4804     NaN
4888     NaN
4890     NaN
Name: Titre_test, dtype: object