This is my first Stack post and I'm new in the worlds of Python and Pandas. I'm trying to get 3 separate values from a column. I tried both .str.split and .str.extract but can't get the result. The df I have:
wow
0 1000100011-DT000111-1111 Hellostreet 45 Town
1 1000100012 DT000122-1222-Hellostrasse 56 Place
2 1000100013-DT000111-1133
3 1000106789 DT000111-1144 Street 45
4 DT000111-1441 Hellostreet 100
The result I want:
number1 number2 street3
0 1000100011 DT000111-1111 Hellostreet 45 Town
1 1000100012 DT000122-1222 Hellostrasse 56 Place
2 1000100013 DT000111-1133 None
3 1000106789 DT000111-1144 Street 45
4 None DT000111-1441 Hellostreet 100
[number1] - numbers are random but always 10 digits [number2] - numbers are random but always 6 first '-' and then 4 digits. Characters are always 'DT'[street3] can consist of a street name, number and/or town. Those do not have to be separated.
Thank you in advance!
To solve it, you can use the .str.extract method, splitting the string into three groups using regular expressions.
import pandas as pd
df = pd.DataFrame({'wow': ['1000100011-DT000111-1111 Hellostreet 45 Town',
'1000100012 DT000122-1222-Hellostrasse 56 Place',
'1000100013-DT000111-1133',
'1000106789 DT000111-1144 Street 45',
'DT000111-1441 Hellostreet 100']})
df = df['wow'].str.extract(
r"(?P<number1>\d{0,10}).*?(?P<number2>DT\d{6}-\d{4}).*?(?P<street3>\w.*|$)")
df.replace('', None, inplace=True)
print(df)
number1 number2 street3
0 1000100011 DT000111-1111 Hellostreet 45 Town
1 1000100012 DT000122-1222 Hellostrasse 56 Place
2 1000100013 DT000111-1133 None
3 1000106789 DT000111-1144 Street 45
4 None DT000111-1441 Hellostreet 100
"
(?P<number1>\d{0,10}).*?(?P<number2>DT\d{6}-\d{4}).*?(?P<street3>\w.*|$)
"
g
Named Capture Group number1 (?P<number1>\d{0,10})
\d matches a digit (equivalent to [0-9])
{0,10} matches the previous token between 0 and 10 times, as many times as possible, giving back as needed (greedy)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Named Capture Group number2 (?P<number2>DT\d{6}-\d{4})
DT matches the characters DT literally (case sensitive)
\d matches a digit (equivalent to [0-9])
{6} matches the previous token exactly 6 times
- matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
{4} matches the previous token exactly 4 times
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Named Capture Group street3 (?P<street3>\w.*|$)
1st Alternative \w.*
\w matches any word character (equivalent to [a-zA-Z0-9_])
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative $
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
Global pattern flags
g modifier: global. All matches (don't return after first match)