Let me first sketch the problem I am trying to solve. I am trying to replace a value '-1' with another value in the same column, based on two other values in the row containing the '-1'. To be more clear, here is an example. In the dataframe underneath, there are two missing values in the 'owner' column. What I want is to replace each '-1' value with a value in the 'owner' column that has the same 'price' value and which is the first one that occurs earlier in 'time' than the '-1' value. So, in this example, the first -1
value is found in line 3. The corresponding 'price' and 'time' is cheap
and 2011-01-01 13:30:00
. So now, I would like to replace the -1
with the name of the owner, that has a cheap
car, which is the first in time before the one under consideration, so the first time before 2011-01-01 13:30:00
. In this case, that would be the one in line 1, with owner name Jane
. This should also automatically be done for any following -1
values (e.g. for the Bmw).
brand price time owner
0 Honda cheap 2008-01-01 13:30:00 Marc
1 Toyota cheap 2009-01-01 13:30:00 Jane
2 Ford alot 2010-01-01 13:30:00 Phil
3 Audi cheap 2011-01-01 13:30:00 -1
4 Volvo cheap 2012-01-01 13:30:00 Jane
5 Bmw alot 2013-01-01 13:30:00 -1
The way I wanted to solve this problem, was by first locating the -1, then saving the corresponding price and time and then locating the first corresponding price in time and replace the owner value. I wanted to use the Pandas Loc method for this in the following way (I included the code to make the dataframe as well).
import pandas as pd
from datetime import datetime
cars = {'brand': ['Honda','Toyota','Ford','Audi','Volvo','Bmw'],
'price': ['cheap','cheap','alot','cheap','cheap','alot'],
'time': [datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2009 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2010 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2011 1:30 PM', '%m/%d/%Y %I:%M %p'),
datetime.strptime('1/1/2012 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2013 1:30 PM', '%m/%d/%Y %I:%M %p')],
'owner': ['Marc', 'Jane','Phil','-1','Jane','-1']}
df = pd.DataFrame(cars, columns = ['brand', 'price','time','owner'])
P_T = df.loc[df.owner == '-1',['price','time']
df.loc[df.owner == '-1', 'owner'] = df.loc[(df.price == P_T.price)&(df.time < P_T.time), 'owner']
As you can see in the final line, this is essentially a loc within a loc, whereas the conditions on the right hand side of the equation are both based on the P_T loc. However, here comes the problem, as I keep getting this error:
ValueError: Can only compare identically-labeled Series objects
I think I am doing something wrong and maybe doing things not as efficient as possible... So I would really appreciate some help with this matter.
I think you've over complicated this - what essentially you need to do is fill -1
values with the last corresenponding value grouped by the price column? if so a forward fill will do ffill
import numpy as np
s = df.replace('-1',np.nan).sort_values('time').groupby(['price'])['owner'].ffill()
df['owner'] = df.index.map(s)
print(df)
brand price time owner
0 Honda cheap 2008-01-01 13:30:00 Marc
1 Toyota cheap 2009-01-01 13:30:00 Jane
2 Ford alot 2010-01-01 13:30:00 Phil
3 Audi cheap 2011-01-01 13:30:00 Jane
4 Volvo cheap 2012-01-01 13:30:00 Jane
5 Bmw alot 2013-01-01 13:30:00 Phil