I have a list of strings in a dataframe df1
temp=df1.loc[df1['quartier']=='quartier1', 'Titre']
The output is:
108 T/42322/C/
108 T/37666/C/
108 T/31411/C/
108 T/118487/01/
108 T/118489/01/
...
108 T/108034/01/
108 T/108045/01/
108 T/108047/01/
108 T/108032/01/
108 T/108045/01/
Shortlist=','.join(temp)
T/42322/C/,T/37666/C/,T/31411/C/,....
and I would like to know how many are present in df2
:
temp_ilot=df2.loc[df2['quartier']=='quartier1', 'Titre_test']
The output of df2
is:
4665 NaN
4666 NaN
4667 NaN
4668 NaN
4669 T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
...
4800 NaN
4804 NaN
4888 NaN
4890 NaN
Any idea would be helpful.
I tried this code but I get a false response when I am sure that the information is there.
temp_ilot.str.contains(Shortlist)
4665 NaN
4666 NaN
4667 NaN
4668 NaN
4669 False
...
4800 NaN
4804 NaN
4888 NaN
4890 NaN
4907 NaN
Name: Titre_Appart, Length: 84, dtype: object
When you use pd.Series.str.contains
, you're asking if the series temp_ilot
contains that string as is. In your question you ask:
and I would like to know how many are present in
df2
You'll need to treat each part of the string as an individual rather than one single string.
from io import StringIO
import numpy as np
import pandas as pd
df = pd.read_csv(
StringIO("""108 T/42322/C/
108 T/37666/C/
108 T/31411/C/
108 T/118487/01/
108 T/118489/01/
108 T/108034/01/
108 T/108045/01/
108 T/108047/01/
108 T/108032/01/
108 T/108045/01/"""),
sep=r"\s+",
index_col=0,
names=["Titre"],
header=None,
)
ser = df.Titre
df2 = pd.read_csv(
StringIO("""4665 NaN
4666 NaN
4667 NaN
4668 NaN
4669 T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
4800 NaN
4804 NaN
4888 NaN
4890 NaN"""),
sep=r"\s+",
names=["Titre_test"],
index_col=[0],
header=None,
)
ser2 = df2.Titre_test
# drop NaN, split on comma, write to list, and convert to array
arr = np.array(ser2.dropna().str.split(",").to_list())
# check if each element is in `arr`
# and sum to determine *how many*
ser.apply(lambda x: np.isin(x, arr)).sum()
# an example
eg = pd.Series(["T/61701/01/", "abc", "test"])
out = eg.apply(lambda x: np.isin(x, arr))
print(out)
# print(out.sum()) # >>> 1
0 True
1 False
2 False
dtype: bool
You could re-use your code but join with a pipe instead of a comma. This allows the regex "or" operator to be used.
from io import StringIO
import pandas as pd
df = pd.read_csv(
StringIO("""108 T/42322/C/
108 T/37666/C/
108 T/31411/C/
108 T/118487/01/
108 T/118489/01/
108 T/108034/01/
108 T/108045/01/
108 T/108047/01/
108 T/108032/01/
108 T/108045/01/"""),
sep=r"\s+",
index_col=0,
names=["Titre"],
header=None,
)
ser = df.Titre
# use a pipe (|) instead of a comma for regex "or"
shortlist = "|".join(ser)
df2 = pd.read_csv(
StringIO("""4665 NaN
4666 NaN
4667 NaN
4668 NaN
4669 T/61701/01/,T/61703/01/,T/61700/01/,T/61706/01...
4800 NaN
4804 NaN
4888 NaN
4890 NaN"""),
sep=r"\s+",
names=["Titre_test"],
index_col=[0],
header=None,
)
ser2 = df2.Titre_test
# regex is True by default
ser2.str.contains(shortlist, regex=True)
# as an example...
out = ser2.str.contains(
"T/61701/01/|abc|test",
regex=True
)
print(out)
4665 NaN
4666 NaN
4667 NaN
4668 NaN
4669 True
4800 NaN
4804 NaN
4888 NaN
4890 NaN
Name: Titre_test, dtype: object