Search code examples
regexregex-group

How can I extract non digit characters and digit characters in the end of a string?


I have a string that has the following structure:

digit-word(s)-digit.

For example:

2029 AG.IZTAPALAPA 2

I want to extract the word(s) in the middle, and the digit at the end of the string.

I want to extract AG.IZTAPALAPA and 2 in the same capture group to extract like:

AG.IZTAPALAPA 2

I managed to capture them as individual capture groups but not as a single:

town_state['municipality'] = town_state['Town'].str.extract(r'(\D+)', expand=False)

town_state['number'] = town_state['Town'].str.extract(r'(\d+)$', expand=False)

Thank you for your help!


Solution

  • Yo can use a single capturing group for the example string to match a single "word" that consists of uppercase chars A-Z with an optional dot in the middle which can not be at the start or end followed by 1 or more digits.

    \b\d+ ([A-Z]+(?:\.[A-Z]+)* \d+)\b
    

    Explanation

    • \b A word boundary
    • \d+
    • ( Capture group 1
      • [A-Z]+ Match 1+ occurrences of an uppercase char A-Z
      • (?:\.[A-Z]+)* \d+ Repeat 0+ times matching a dot and a char A-Z followed by matching 1+ digits
    • ) Close group 1
    • \b A word boundary

    Regex demo

    Or you can make the pattern a bit broader matching either a dot or a word character

    \b\d+ ([\w.]+(?: [\w.]+)* \d+)\b
    

    Regex demo