Search code examples
pythonpandaslinear-interpolation

Python Pandas Linear Interpolate Y over X


I'm trying to answer this Udacity question: https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48532778/m-48635592

I like Python & Pandas so I'm using Pandas (version 0.14)

I have this DataFrame df=

pd.DataFrame(dict(size=(1400,
                        2400,
                        1800,
                        1900,
                        1300,
                        1100), 
                   cost=(112000,
                         192000,
                         144000,
                         152000,
                         104000,
                         88000)))

I added this value of 2100 square foot to my data frame (notice there is no cost; that is the question; what would you expect to pay for a house of 2,100 sq ft)

 df.append(pd.DataFrame({'size':(2100,)}), True)

The question wants you to answer what cost/price you expect to pay, using linear interpolation.

Can Pandas interpolate? And how?

I tried this:

df.interpolate(method='linear')

But it gave me a cost of 88,000; just the last cost value repeated

I tried this:

df.sort('size').interpolate(method='linear')

But it gave me a cost of 172,000; just halfway between the costs of 152,000 and 192,000 Closer, but not what I want. The correct answer is 168,000 (because there is a "slope" of $80/sqft)

EDIT:

I checked these SO questions


Solution

  • Pandas' method='linear' interpolation will do what I call "1D" interpolation

    If you want to interpolate a "dependent" variable over an "independent" variable, make the "independent" variable; i.e. the Index of a Series, and use the method='index' (or method='values', they're the same)

    In other words:

    pd.Series(index=df.size, data=df.cost.values) #Make size the independent variable
        # SEE ANSWER BELOW; order() method is deprecated; use sort_values() instead
        .order() #Orders by the index, which is size in sq ft; interpolation depends on order (see OP)
        .interpolate(method='index')[2100] #Interpolate using method 'index'
    

    This returns the correct answer 168,000

    This is not clear to me from the example in Pandas Documentation, where the Series' data and index are the same list of values.