I'm trying to extract some values from 1 column. Some rows have 3 values I want to with different setups.
My dataset:
col1
0 1001100100 / hello street 2 a town1
1 1001100102 ;hello 3 towns2
2 STRZ19-0072 DT001001-0100 location1 town4
3 1001100103_hello street 3, town5
4 DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7
I need:
I need 3 new columns with different values focused on the 10 digit number starting with "100", the 'DT'number with 6 digits-4 digits and the street + town combined. Other values like 'STRZ19-0072' are not relevant.
I tried this:
df2 = df['col1'].str.extract(r"(?<col2>\d{10})|(?<col3>PR\d{6}-\d{4})|(?<col4>\w.*)")
This does not get me the DT-number from row 3.
Expected result:
col2 col3 address
0 1001100100 NaN hello street 2 a town1
1 1001100102 NaN hello 3 towns2
2 NaN DT001001-0100 location1 town4
3 1001100103 NaN hello street 3, town5
4 NaN DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7
Appreciate the help and effort. Thank you!
I'd do it in 3 separate steps:
df["col2"] = df["col1"].str.extract(r"(100\d{7})")
df["col3"] = df["col1"].str.extract(r"(DT\d{6}-\d{4})")
df["col4"] = df["col1"].str.extract(r"([A-Za-z][A-Za-z\d\s,]+)$")
print(df)
Prints:
col1 col2 col3 col4
0 1001100100 / hello street 2 a town1 1001100100 NaN hello street 2 a town1
1 1001100102 ;hello 3 towns2 1001100102 NaN hello 3 towns2
2 STRZ19-0072 DT001001-0100 location1 town4 NaN DT001001-0100 location1 town4
3 1001100103_hello street 3, town5 1001100103 NaN hello street 3, town5
4 DT001002-0100 street 78 NaN DT001002-0100 street 78
5 1001100107 DT001002-0102 street 6a town7 1001100107 DT001002-0102 street 6a town7