I have a dataframe that looks like:
data1 = [{'price2022': "12014\n205****", 'company': "toyota",'price2023': "10014\n180****"},
{'price2022': "22018", 'company': "apple",'price2023': "22018"},
{'price2022': "32020", 'company': "general electric",'price2023': "31020"},
{'price2022': "80170", 'company': "alibaba",'price2023': "83170"}
]
df1 = pd.DataFrame(data1)
The first value is a multiline string, which also contains the redundant string ''. Instead of the multiline string "12014\n205", I would like to have a single line number that is the sum of the two lines. (12014+205=12219).
I could try something like this:
dfa[['b', 'c']] = df1["price2022"].apply(lambda x: pd.Series(str(x).split("\n")))
dfa['c'] = dfa['c'].map(lambda x: str(x)[:-4]) #gets rid of the ****, probably not the smartest method
dfa['b']= dfa['b'].astype('int')
dfa['c'].replace('', 0, inplace=True)
dfa['c']= dfa['c'].astype('int')
dfa['d']=dfa['b']+dfa['c']
However, this seems incredibly inefficient. Not to mention that I have several 'price' columns I need to run through. Creating new variables for each seems like a bad way to deal with this. Is there a more efficient way to do this without creating multiple new columns? How would I extend this such that I don't have to go have a look which columns have these multi lines and which don't, but the code just runs through all?
You can loop over you columns and apply a function that will split/sum the prices:
import pandas as pd
data1 = [{'price2022': "12014\n205****", 'company': "toyota",'price2023': "10014\n180****"},
{'price2022': "22018", 'company': "apple",'price2023': "22018"},
{'price2022': "32020", 'company': "general electric",'price2023': "31020"},
{'price2022': "80170", 'company': "alibaba",'price2023': "83170"}
]
df1 = pd.DataFrame(data1)
for col in df1.columns:
if col.startswith('price'):
df1[f'{col}_sum'] = df1[col].apply(lambda x: sum(map(int, x.strip('****').split('\n'))))
print(df1)
Output:
price2022 company price2023 price2022_sum price2023_sum
0 12014\n205**** toyota 10014\n180**** 12219 10194
1 22018 apple 22018 22018 22018
2 32020 general electric 31020 32020 31020
3 80170 alibaba 83170 80170 83170