Search code examples
pythonnlpseriestopic-modeling

Python code to return elements in a Series


I am currently putting together a script for topic modelling scraped Tweets but I am having a couple of issues. I want to be able to search for all instances of a word, then return all instances of that word, plus the words before and after, in order to provide better context into the use of a word.

I have tokenised all the tweets, and added them to a Series where the relative index position is used to identify surrounding words.

The code I currently have is:

    myseries = pd.Series(["it", 'was', 'a', 'bright', 'cold', 'day', 'in', 'april'], 
                          index= [0,1,2,3,4,5,6,7])

    def phrase(w):
        search_word= myseries[myseries == w].index[0]
        before = myseries[[search_word- 1]].index[0]
        after = myseries[[search_word+ 1]].index[0]
        print(myseries[before], myseries[search_word], myseries[after])

The code mostly works, but will return an error if the first or last word is searched, as it falls outside the index range of the Series. Is there a way to ignore out of range indexes and simply return what is within range?

The current code also only returns the word before and after the searched word. I want to be able to input a number into the function which then returns a range of words before and after, but my current code is hard coded. Is there a way to have it return a designated range of elements?

I am also having issues creating a loop to search the entire series. Depending on what I write it either returns the first element and nothing else, or repeatedly prints the first element over and over again rather than continuing on with the search. The offending bit of code that keeps repeating the first element is:

    def ws(word):
        for element in tokened_df:
            if word == element:
                search_word = tokened_df[tokened_df == word].index[0]
                before = tokened_df[[search_word - 1]].index[0]
                after = tokened_df[[search_word + 1]].index[0]
                print(tokened_df[before], word, tokened_df[after])

There is obviously something simple I've overlooked, but can't for the life of me figure out what it is. How can I modify the code so that if the same word is repeated in the series, it will return each instance of the word, plus the surrounding words? The way I want it to work follows the logic of 'if condition is true, execute 'phrase' function, if not true, continue down the series.


Solution

  • Something like this? I have added a repeated word ("bright") to your example. Also added n_before and n_after to put in number of surrounding words

    import pandas as pd
    myseries = pd.Series(["it", 'was', 'a', 'bright', 'bright', 'cold', 'day', 'in', 'april'], 
                              index= [0,1,2,3,4,5,6,7,8])
    
    def phrase(w, n_before=1, n_after=1):
        search_words = myseries[myseries == w].index
    
        for index in search_words:
            start_index = max(index - n_before, 0)
            end_index = min(index + n_after+1, myseries.shape[0])
            print(myseries.iloc[start_index: end_index])
    
    phrase("bright", n_before=2, n_after=3)
     
    

    This gives:

    1       was
    2         a
    3    bright
    4    bright
    5      cold
    6       day
    dtype: object
    2         a
    3    bright
    4    bright
    5      cold
    6       day
    7        in
    dtype: object