First, apologies if this sounds too basic. I have the following semi-structured data in text format, I need to parse these into a structured format: example:
Name
Alex
Address
14 high street
London
Color
blue
red
Name
Bob
Color
black
**Note that Alex has two colors, while Bob does not have an address. **
I want something that looks like this:
I think the right way is using regular expressions, but I'm struggling to split the text properly since some fields may be missing. What's a proper clean way to do this?
text='Name\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\nName\nBob\nColor\nblack'
profiles=re.split('(Name\n)', text, flags=re.IGNORECASE)
for profile in profiles:
#get name
name=re.split('(Name\n)|(Address\n)|(Color\n)', profile.strip(), flags=re.IGNORECASE)[0]
print(name)
#get address
#get color
Try:
s = """\
Name
Alex
Address
14 high street
London
Color
blue
red
Name
Bob
Color
black"""
import pandas as pd
from itertools import groupby
colnames = ["Name", "Address", "Color"]
col1, col2 = [], []
for k, g in groupby(
(l for l in s.splitlines() if l.strip()), lambda l: l in colnames
):
(col2, col1)[k].append(" ".join(g))
df = pd.DataFrame({"col1": col1, "col2": col2})
df = df.assign(col3=df.col1.eq("Name").cumsum()).pivot(
index="col3", columns="col1", values="col2"
)
df.index.name, df.columns.name = None, None
df["Color"] = df["Color"].str.split()
df = df.explode("Color").fillna("")
print(df[colnames])
Prints:
Name Address Color
1 Alex 14 high street London blue
1 Alex 14 high street London red
2 Bob black