Search code examples
pythontwitter

Finding last word in tweepy tweet response python


I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.

for example in

NC don’t like working together www.linktowtweet.org

get back

 together

Solution

  • I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.

    However, given a string in python, it simple to extract the last word.

    Solution 1

    Use str.rfind(' '). The idea here is to find the space, preceding the last word. Here is an example.

    text = "NC don’t like working together"
    text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
    last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
    print(last_word)
    

    Note: If a string is given with no words, last_word will be a blank string.

    Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r, but I presume only newlines and tabs will be found in twitter messages.

    Also see: string.whitespace

    So a complete example (wrapped as a function) would be

    def last_word(text):
        text = text.replace('\n', ' ') # Replace newlines with spaces.
        text = text.replace('\t', ' ') # Replace tabs with spaces.
        text = text.rstrip(' ') # Remove trailing spaces.
        return text[text.rfind(' ')+1:]
    
    print(last_word("NC don’t like working together")) # Outputs "together".
    

    This may still be the best situation for basic parsing. There is something better for larger problems.

    Solution 2

    Regular Expressions

    These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.

    For example, .*\s(\S+) specifies the last word in a string.

    Here is it again with a longer explanation.

    .*               # Match as many characters as possible.
    \s               # Until a whitespace ("\t\n\x0b\x0c\r ")
    (                # Remember the next section for the answer.
    \S+              # Match a ~word~ (not whitespace) as possible.
    )                # End saved section.
    

    So then, in python you would use this as follows.

    import re # Import the REGEX library.
    
    # Compile the code, (DOTALL makes . match \n).
    LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL) 
    
    def last_word(text):
        m = LAST_WORD_PATTERN.match(text)
        if not m: # If there was not a last word to this text.
            return ''
        return m.group(1) # Otherwise return the last word.
    
    print(last_word("NC don’t like working together")) # Outputs "together".
    

    Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b" would match the last word, but ignore a link if that was the last thing.

    Example:

    import re # Import the REGEX library.
    
    # Compile the code, (DOTALL makes . match \n).
    LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
    
    def last_word(text):
        m = LAST_WORD_PATTERN.match(text)
        if not m: # If there was not a last word to this text.
            return ''
        return m.group(1) # Otherwise return the last word.
    
    print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
    

    The second advantage to this method is speed.

    As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)

    Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.


    EDIT Changed the url avoiding regex from

    re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
    

    to

    re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
    

    So that calling last_word("NC don’t like working together http://www.linktowtweet.org") returns together and not http://.

    To so how this regex works, look at https://regex101.com/r/sdwpqB/2.