Search code examples
pythonregexpandasstring-operations

Pandas string operations (extract and findall)


Here are 2 examples on string operation methods from Python data science handbook, that I am having troubles understanding.

  1. str.extract()
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                              'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte.str.extract('([A-Za-z]+)')

This operation returns the first name of each element in the Series. I don't get the expression input in the extract function.

  1. str.findall()
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

This operation returns the original element if it starts and ends with consonants, returns an empty list otherwise. I figure that the ^ operator stands for negation of vowels. * operator combines the situations of upper and lower cases of vowels. Yet I do not understand the rest of the operators.

Please help me with understanding these input expressions. Thanks in advance.


Solution

  • The first ^ means in the beginning of the string, whereas $ means in the end of the string, here is an example:

    >>> import re
    >>> s = 'a123a'
    >>> re.findall('^a', s)
    ['a']
    >>> 
    

    This only prints one a because I have the ^ sign which only finds in the begging of the string.

    This is the same for $, $ only finds stuff from the end of the string, here is an example:

    >>> import re
    >>> s = 'a123a'
    >>> re.findall('a$', s)
    ['a']
    >>> 
    

    Edited:

    The meaning of r is a raw string. Raw string it is what it looks like. For example, a backslash \ doesn't escape, it will just be a regular backslash.