Search code examples
pythonpandasreplacevalueerrorpandas-loc

Is it possible to use a loc inside a loc in Pandas to replace a value?


Let me first sketch the problem I am trying to solve. I am trying to replace a value '-1' with another value in the same column, based on two other values in the row containing the '-1'. To be more clear, here is an example. In the dataframe underneath, there are two missing values in the 'owner' column. What I want is to replace each '-1' value with a value in the 'owner' column that has the same 'price' value and which is the first one that occurs earlier in 'time' than the '-1' value. So, in this example, the first -1 value is found in line 3. The corresponding 'price' and 'time' is cheap and 2011-01-01 13:30:00. So now, I would like to replace the -1 with the name of the owner, that has a cheapcar, which is the first in time before the one under consideration, so the first time before 2011-01-01 13:30:00. In this case, that would be the one in line 1, with owner name Jane. This should also automatically be done for any following -1 values (e.g. for the Bmw).

   brand   price time                 owner
0   Honda  cheap 2008-01-01 13:30:00  Marc
1  Toyota  cheap 2009-01-01 13:30:00  Jane
2    Ford   alot 2010-01-01 13:30:00  Phil
3    Audi  cheap 2011-01-01 13:30:00    -1
4   Volvo  cheap 2012-01-01 13:30:00  Jane
5     Bmw   alot 2013-01-01 13:30:00    -1

The way I wanted to solve this problem, was by first locating the -1, then saving the corresponding price and time and then locating the first corresponding price in time and replace the owner value. I wanted to use the Pandas Loc method for this in the following way (I included the code to make the dataframe as well).

import pandas as pd
from datetime import datetime

cars = {'brand': ['Honda','Toyota','Ford','Audi','Volvo','Bmw'],
        'price': ['cheap','cheap','alot','cheap','cheap','alot'],
        'time': [datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2009 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2010 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2011 1:30 PM', '%m/%d/%Y %I:%M %p'),
                 datetime.strptime('1/1/2012 1:30 PM', '%m/%d/%Y %I:%M %p'),datetime.strptime('1/1/2013 1:30 PM', '%m/%d/%Y %I:%M %p')],
        'owner': ['Marc', 'Jane','Phil','-1','Jane','-1']}

df = pd.DataFrame(cars, columns = ['brand', 'price','time','owner'])

P_T = df.loc[df.owner == '-1',['price','time']

df.loc[df.owner == '-1', 'owner'] = df.loc[(df.price == P_T.price)&(df.time < P_T.time), 'owner']

As you can see in the final line, this is essentially a loc within a loc, whereas the conditions on the right hand side of the equation are both based on the P_T loc. However, here comes the problem, as I keep getting this error: ValueError: Can only compare identically-labeled Series objects I think I am doing something wrong and maybe doing things not as efficient as possible... So I would really appreciate some help with this matter.


Solution

  • I think you've over complicated this - what essentially you need to do is fill -1 values with the last corresenponding value grouped by the price column? if so a forward fill will do ffill

    import numpy as np
    s = df.replace('-1',np.nan).sort_values('time').groupby(['price'])['owner'].ffill()
    
    df['owner'] = df.index.map(s)
    
    
    
    print(df)
    
    
        brand  price                time owner
    0   Honda  cheap 2008-01-01 13:30:00  Marc
    1  Toyota  cheap 2009-01-01 13:30:00  Jane
    2    Ford   alot 2010-01-01 13:30:00  Phil
    3    Audi  cheap 2011-01-01 13:30:00  Jane
    4   Volvo  cheap 2012-01-01 13:30:00  Jane
    5     Bmw   alot 2013-01-01 13:30:00  Phil