Search code examples
pythonpandasrows

How to search for text across multiple rows in a pandas dataframe?


So I'm quite new to Python, and I was just wondering if it is possible for me to use it in order to search for text across multiple rows. Here is a screenshot of my dataframe:

https://i.sstatic.net/jeqpv.png

To make it clearer, what I would like to do is search for phrases or expressions containing more than one word, such as 'New Jersey,' however, each word makes up a separate row so I do not know how to go about including more than one row in the query. I would also, if possible, like to create a new column which will label any matches with 'M' and those without 'N.' All help is appreciated to make this easier for me!


Solution

  • The idea is to join all rows to be able to search multiple continuous words.

    For example, we want to find the phrase "she wants to" in whole dataframe:

    >>> df
       subtitle
    0       She  # <- start here (1)
    1     wants  #
    2        to  # <- end here (1)
    3      sing
    4       she  # <- start here (2)
    5     wants  #
    6        to  # <- end here (2)
    7       act
    8       she  # <- start here (3)
    9     wants  # 
    10       to  # <- end here (3)
    11    dance
    
    import re
    
    search = "she wants to"
    text = " ".join(df["subtitle"])
    
    # index of start / end position of the word in text
    end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
    start = end.shift(fill_value=-1) + 1
    
    # create additional columns
    df["start"] = start.tolist()
    df["end"] = end.tolist()
    df["match"] = False
    
    # find all iteration of the search text
    for match in re.finditer(search, text, re.IGNORECASE):
        idx1 = df[df["start"] == match.start()].index[0]
        idx2 = df[df["end"] == match.end()].index[0]
        df.loc[idx1:idx2, "match"] = True
    
    >>> df
       subtitle  start  end  match
    0       She      0    3   True
    1     wants      4    9   True
    2        to     10   12   True
    3      sing     13   17  False
    4       she     18   21   True
    5     wants     22   27   True
    6        to     28   30   True
    7       act     31   34  False
    8       she     35   38   True
    9     wants     39   44   True
    10       to     45   47   True
    11    dance     48   53  False
    

    Update: search for multiple terms:

    Change only:

    # search = "she wants to"
    search = ["she wants to", "if you", "I will"]
    search = fr"({'|'.join(search)})"
    
    # df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
    >>> df
       subtitle  start  end  match
    0       She      0    3   True
    1     wants      4    9   True
    2        to     10   12   True
    3      sing     13   17  False
    4       she     18   21   True
    5     wants     22   27   True
    6        to     28   30   True
    7       act     31   34  False
    8       she     35   38   True
    9     wants     39   44   True
    10       to     45   47   True
    11    dance     48   53  False
    12       If     54   56   True
    13      you     57   60   True
    14     sing     61   65  False
    15        I     66   67   True
    16     will     68   72   True
    17    smile     73   78  False
    18       if     79   81   True
    19      you     82   85   True
    20    laugh     86   91  False
    21        I     92   93   True
    22     will     94   98   True
    23    smile     99  104  False
    24       if    105  107   True
    25      you    108  111   True
    26     love    112  116  False
    27        I    117  118   True
    28     will    119  123   True
    29    smile    124  129  False
    

    Update 2: terms into text file:

    $ cat terms.txt
    she wants to
    if you
    I will
    
    search = [term.strip() for term in open("terms.txt").readlines()]
    search = fr"({'|'.join(search)})"