Search code examples
pythonpandasmultiline

Multiline string containing numbers


I have a dataframe that looks like:

 data1 = [{'price2022': "12014\n205****", 'company': "toyota",'price2023': "10014\n180****"}, 
    {'price2022': "22018", 'company': "apple",'price2023': "22018"}, 
    {'price2022': "32020", 'company': "general electric",'price2023': "31020"}, 
    {'price2022': "80170", 'company': "alibaba",'price2023': "83170"}
   ]
 df1 = pd.DataFrame(data1)

The first value is a multiline string, which also contains the redundant string ''. Instead of the multiline string "12014\n205", I would like to have a single line number that is the sum of the two lines. (12014+205=12219).

I could try something like this:

dfa[['b', 'c']] = df1["price2022"].apply(lambda x: pd.Series(str(x).split("\n")))
dfa['c'] = dfa['c'].map(lambda x: str(x)[:-4])   #gets rid of the ****, probably not the smartest method
dfa['b']=  dfa['b'].astype('int')
dfa['c'].replace('', 0, inplace=True)
dfa['c']=  dfa['c'].astype('int')
dfa['d']=dfa['b']+dfa['c']

However, this seems incredibly inefficient. Not to mention that I have several 'price' columns I need to run through. Creating new variables for each seems like a bad way to deal with this. Is there a more efficient way to do this without creating multiple new columns? How would I extend this such that I don't have to go have a look which columns have these multi lines and which don't, but the code just runs through all?


Solution

  • You can loop over you columns and apply a function that will split/sum the prices:

    import pandas as pd
    
    data1 = [{'price2022': "12014\n205****", 'company': "toyota",'price2023': "10014\n180****"}, 
        {'price2022': "22018", 'company': "apple",'price2023': "22018"}, 
        {'price2022': "32020", 'company': "general electric",'price2023': "31020"}, 
        {'price2022': "80170", 'company': "alibaba",'price2023': "83170"}
       ]
    df1 = pd.DataFrame(data1)
    
    for col in df1.columns:
        if col.startswith('price'):
            df1[f'{col}_sum'] = df1[col].apply(lambda x: sum(map(int, x.strip('****').split('\n'))))
    
    print(df1)
    

    Output:

            price2022           company       price2023  price2022_sum  price2023_sum
    0  12014\n205****            toyota  10014\n180****          12219          10194
    1           22018             apple           22018          22018          22018
    2           32020  general electric           31020          32020          31020
    3           80170           alibaba           83170          80170          83170