Search code examples
pythonpandasregexextract

Pandas: extracting values via regex in 1 column


I'm trying to extract some values from 1 column. Some rows have 3 values I want to with different setups.

My dataset:

     col1
0    1001100100 / hello street 2 a town1
1    1001100102 ;hello 3 towns2
2    STRZ19-0072 DT001001-0100 location1 town4
3    1001100103_hello street 3, town5
4    DT001002-0100 street  78
5    1001100107 DT001002-0102 street 6a town7

I need:

I need 3 new columns with different values focused on the 10 digit number starting with "100", the 'DT'number with 6 digits-4 digits and the street + town combined. Other values like 'STRZ19-0072' are not relevant.

I tried this:

df2 = df['col1'].str.extract(r"(?<col2>\d{10})|(?<col3>PR\d{6}-\d{4})|(?<col4>\w.*)")

This does not get me the DT-number from row 3.

Expected result:

     col2        col3           address 
0    1001100100  NaN            hello street 2 a town1
1    1001100102  NaN            hello 3 towns2
2    NaN         DT001001-0100  location1 town4
3    1001100103  NaN            hello street 3, town5
4    NaN         DT001002-0100  street  78
5    1001100107  DT001002-0102  street 6a town7

Appreciate the help and effort. Thank you!


Solution

  • I'd do it in 3 separate steps:

    df["col2"] = df["col1"].str.extract(r"(100\d{7})")
    df["col3"] = df["col1"].str.extract(r"(DT\d{6}-\d{4})")
    df["col4"] = df["col1"].str.extract(r"([A-Za-z][A-Za-z\d\s,]+)$")
    print(df)
    

    Prints:

                                            col1        col2           col3                    col4
    0        1001100100 / hello street 2 a town1  1001100100            NaN  hello street 2 a town1
    1                 1001100102 ;hello 3 towns2  1001100102            NaN          hello 3 towns2
    2  STRZ19-0072 DT001001-0100 location1 town4         NaN  DT001001-0100         location1 town4
    3           1001100103_hello street 3, town5  1001100103            NaN   hello street 3, town5
    4                    DT001002-0100 street 78         NaN  DT001002-0100               street 78
    5   1001100107 DT001002-0102 street 6a town7  1001100107  DT001002-0102         street 6a town7