Search code examples
pythonpandasdataframesubstringjson-normalize

Pandas substring DataFrame column


I have a pandas DataFrame, with a column called positions, that includes string values with the syntax of the following examples:

[{'y': 49, 'x': 44}, {'y': 78, 'x': 31}]
[{'y': 1, 'x': 63}, {'y': 0, 'x': 23}]
[{'y': 54, 'x': 9}, {'y': 78, 'x': 3}]

I want to create four new columns in my pandas DataFrame, y_start, x_start, y_end, x_end, that are extractions of only the numbers.

E.g. for the example of the first row, my new columns would have the following values:

y_start = 49
x_start = 44
y_end = 78
x_end = 31

To summarise, I am looking to extract just the first, second, third, and four occurrence of numbers and save these to individual columns.


Solution

  • import pandas as pd
    from ast import literal_eval
    
    # dataframe
    data = {'data': ["[{'y': 49, 'x': 44}, {'y': 78, 'x': 31}]", "[{'y': 1, 'x': 63}, {'y': 0, 'x': 23}]", "[{'y': 54, 'x': 9}, {'y': 78, 'x': 3}]"]}
    
    df = pd.DataFrame(data)
    
    # convert the strings in the data column to dicts
    df.data = df.data.apply(literal_eval)
    
    # separate the strings into separate columns
    df[['start', 'end']] = pd.DataFrame(df.data.tolist(), index=df.index)
    
    # use json_normalize to convert the dicts to separate columns and join the dataframes with concat
    cleaned = pd.concat([pd.json_normalize(df.start).rename(lambda x: f'{x}_start', axis=1), pd.json_normalize(df.end).rename(lambda x: f'{x}_end', axis=1)], axis=1)
    
    # display(cleaned)
       y_start  x_start  y_end  x_end
    0       49       44     78     31
    1        1       63      0     23
    2       54        9     78      3