Search code examples
pythonpandassplitextract

Pandas: how to split or extract multiple values from column


This is my first Stack post and I'm new in the worlds of Python and Pandas. I'm trying to get 3 separate values from a column. I tried both .str.split and .str.extract but can't get the result. The df I have:

     wow
0    1000100011-DT000111-1111 Hellostreet 45  Town
1    1000100012  DT000122-1222-Hellostrasse 56 Place
2    1000100013-DT000111-1133
3    1000106789 DT000111-1144 Street 45
4    DT000111-1441 Hellostreet 100

The result I want:

     number1      number2          street3
0    1000100011   DT000111-1111    Hellostreet 45  Town
1    1000100012   DT000122-1222    Hellostrasse 56 Place
2    1000100013   DT000111-1133    None
3    1000106789   DT000111-1144    Street 45
4    None         DT000111-1441    Hellostreet 100

[number1] - numbers are random but always 10 digits [number2] - numbers are random but always 6 first '-' and then 4 digits. Characters are always 'DT'[street3] can consist of a street name, number and/or town. Those do not have to be separated.

Thank you in advance!


Solution

  • To solve it, you can use the .str.extract method, splitting the string into three groups using regular expressions.

    import pandas as pd
    
    
    df = pd.DataFrame({'wow': ['1000100011-DT000111-1111 Hellostreet 45  Town',
                               '1000100012  DT000122-1222-Hellostrasse 56 Place',
                               '1000100013-DT000111-1133',
                               '1000106789 DT000111-1144 Street 45',
                               'DT000111-1441 Hellostreet 100']})
    
    df = df['wow'].str.extract(
        r"(?P<number1>\d{0,10}).*?(?P<number2>DT\d{6}-\d{4}).*?(?P<street3>\w.*|$)")
    df.replace('', None, inplace=True)
    print(df)
    
          number1        number2                street3
    0  1000100011  DT000111-1111   Hellostreet 45  Town
    1  1000100012  DT000122-1222  Hellostrasse 56 Place
    2  1000100013  DT000111-1133                   None
    3  1000106789  DT000111-1144              Street 45
    4        None  DT000111-1441        Hellostreet 100
    
    
    "
    (?P<number1>\d{0,10}).*?(?P<number2>DT\d{6}-\d{4}).*?(?P<street3>\w.*|$)
    "
    g
    Named Capture Group number1 (?P<number1>\d{0,10})
    \d matches a digit (equivalent to [0-9])
    {0,10} matches the previous token between 0 and 10 times, as many times as possible, giving back as needed (greedy)
    . matches any character (except for line terminators)
    *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    Named Capture Group number2 (?P<number2>DT\d{6}-\d{4})
    DT matches the characters DT literally (case sensitive)
    \d matches a digit (equivalent to [0-9])
    {6} matches the previous token exactly 6 times
    - matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
    \d matches a digit (equivalent to [0-9])
    {4} matches the previous token exactly 4 times
    . matches any character (except for line terminators)
    *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    Named Capture Group street3 (?P<street3>\w.*|$)
    1st Alternative \w.*
    \w matches any word character (equivalent to [a-zA-Z0-9_])
    . matches any character (except for line terminators)
    * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    2nd Alternative $
    $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
    Global pattern flags 
    g modifier: global. All matches (don't return after first match)