Search code examples
pythonstringdataframepython-re

How to convert text into structured data, taking into account missing fields, in Python?


First, apologies if this sounds too basic. I have the following semi-structured data in text format, I need to parse these into a structured format: example:

Name
Alex

Address
14 high street
London

Color
blue
red

Name

Bob
Color
black

**Note that Alex has two colors, while Bob does not have an address. **

I want something that looks like this:

example output

I think the right way is using regular expressions, but I'm struggling to split the text properly since some fields may be missing. What's a proper clean way to do this?

text='Name\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\nName\nBob\nColor\nblack'

profiles=re.split('(Name\n)', text, flags=re.IGNORECASE)
for profile in profiles:
    #get name
    name=re.split('(Name\n)|(Address\n)|(Color\n)', profile.strip(), flags=re.IGNORECASE)[0]
    print(name)
    
    #get address

    #get color

Solution

  • Try:

    s = """\
    Name
    Alex
    
    Address
    14 high street
    London
    
    Color
    blue
    red
    
    Name
    
    Bob
    Color
    black"""
    
    
    import pandas as pd
    from itertools import groupby
    
    colnames = ["Name", "Address", "Color"]
    
    
    col1, col2 = [], []
    for k, g in groupby(
        (l for l in s.splitlines() if l.strip()), lambda l: l in colnames
    ):
        (col2, col1)[k].append(" ".join(g))
    
    df = pd.DataFrame({"col1": col1, "col2": col2})
    df = df.assign(col3=df.col1.eq("Name").cumsum()).pivot(
        index="col3", columns="col1", values="col2"
    )
    df.index.name, df.columns.name = None, None
    
    
    df["Color"] = df["Color"].str.split()
    df = df.explode("Color").fillna("")
    
    print(df[colnames])
    

    Prints:

       Name                Address  Color
    1  Alex  14 high street London   blue
    1  Alex  14 high street London    red
    2   Bob                         black