I'm learning python and now I want to split a string without delimiters. string is in a dataframe column pandas and I want to divide the string into multiple columns.
What is the best way?
Data:
"Naam: TEST B.V. Omschrijving: Factuur 20-01-2024, klantnummer 1234567890. IBAN: NL41INGB0000467598 Kenmerk: 000011292292967 Machtiging ID: M10024815057 Incassant ID: NL39KPN271247010001 Doorlopende incasso Valutadatum: 24-01-2024"
expect output:
Naam | Omschrijving | IBAN | Kenmerk | MachtigingID | IncassantID | Valutadatum |
---|---|---|---|---|---|---|
TEST B.V. | Factuur 20-01-2024, klantnummer 1234567890 | NL41INGB0000467598 | 000011292292967 | M10024815057 | NL39KPN271247010001 | 24-01-2024 |
Using a regex to extract the word before ':'
as a key (without shade except if ending in ' ID'
:
import re
data = "Naam: TEST B.V. Omschrijving: Factuur 20-01-2024, klantnummer 1234567890. IBAN: NL41INGB0000467598 Kenmerk: 000011292292967 Machtiging ID: M10024815057 Incassant ID: NL39KPN271247010001 Doorlopende incasso Valutadatum: 24-01-2024"
out = pd.DataFrame([dict(re.findall(r'(\S+(?: ID)?): ([^:]+?) *(?=$|\b[^:\s]+(?: ID)?:)', data))])
Note that it takes the full string after IncassantID
, you might need to post-process it if you really just need the first word.
Output:
Naam Omschrijving IBAN Kenmerk Machtiging ID Incassant ID Valutadatum
0 TEST B.V. Factuur 20-01-2024, klantnummer 1234567890. NL41INGB0000467598 000011292292967 M10024815057 NL39KPN271247010001 Doorlopende incasso 24-01-2024