Search code examples
pythonpandasstringdataframesplit

split object in dataframe pandas without delimiters


I'm learning python and now I want to split a string without delimiters. string is in a dataframe column pandas and I want to divide the string into multiple columns.

What is the best way?

Data:

"Naam: TEST B.V. Omschrijving: Factuur 20-01-2024, klantnummer 1234567890. IBAN: NL41INGB0000467598 Kenmerk: 000011292292967 Machtiging ID: M10024815057 Incassant ID: NL39KPN271247010001 Doorlopende incasso Valutadatum: 24-01-2024"

expect output:

Naam Omschrijving IBAN Kenmerk MachtigingID IncassantID Valutadatum
TEST B.V. Factuur 20-01-2024, klantnummer 1234567890 NL41INGB0000467598 000011292292967 M10024815057 NL39KPN271247010001 24-01-2024

Solution

  • Using a regex to extract the word before ':' as a key (without shade except if ending in ' ID':

    import re
    
    data = "Naam: TEST B.V. Omschrijving: Factuur 20-01-2024, klantnummer 1234567890. IBAN: NL41INGB0000467598 Kenmerk: 000011292292967 Machtiging ID: M10024815057 Incassant ID: NL39KPN271247010001 Doorlopende incasso Valutadatum: 24-01-2024"
    
    out = pd.DataFrame([dict(re.findall(r'(\S+(?: ID)?): ([^:]+?) *(?=$|\b[^:\s]+(?: ID)?:)', data))])
    

    Note that it takes the full string after IncassantID, you might need to post-process it if you really just need the first word.

    Output:

            Naam                                 Omschrijving                IBAN          Kenmerk Machtiging ID                             Incassant ID Valutadatum
    0  TEST B.V.  Factuur 20-01-2024, klantnummer 1234567890.  NL41INGB0000467598  000011292292967  M10024815057  NL39KPN271247010001 Doorlopende incasso  24-01-2024
    

    Regex demo