So I'm quite new to Python, and I was just wondering if it is possible for me to use it in order to search for text across multiple rows. Here is a screenshot of my dataframe:
https://i.sstatic.net/jeqpv.png
To make it clearer, what I would like to do is search for phrases or expressions containing more than one word, such as 'New Jersey,' however, each word makes up a separate row so I do not know how to go about including more than one row in the query. I would also, if possible, like to create a new column which will label any matches with 'M' and those without 'N.' All help is appreciated to make this easier for me!
The idea is to join all rows to be able to search multiple continuous words.
For example, we want to find the phrase "she wants to" in whole dataframe:
>>> df
subtitle
0 She # <- start here (1)
1 wants #
2 to # <- end here (1)
3 sing
4 she # <- start here (2)
5 wants #
6 to # <- end here (2)
7 act
8 she # <- start here (3)
9 wants #
10 to # <- end here (3)
11 dance
import re
search = "she wants to"
text = " ".join(df["subtitle"])
# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False
# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
idx1 = df[df["start"] == match.start()].index[0]
idx2 = df[df["end"] == match.end()].index[0]
df.loc[idx1:idx2, "match"] = True
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
Update: search for multiple terms:
Change only:
# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
12 If 54 56 True
13 you 57 60 True
14 sing 61 65 False
15 I 66 67 True
16 will 68 72 True
17 smile 73 78 False
18 if 79 81 True
19 you 82 85 True
20 laugh 86 91 False
21 I 92 93 True
22 will 94 98 True
23 smile 99 104 False
24 if 105 107 True
25 you 108 111 True
26 love 112 116 False
27 I 117 118 True
28 will 119 123 True
29 smile 124 129 False
Update 2: terms into text file:
$ cat terms.txt
she wants to
if you
I will
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"