Search code examples
pythonpandaslambdaapply

Why is Series.apply() returning a dataframe instead of a series?


I'm trying to write a k-means algorithm from scratch. Suppose I have the following dataframe...

df = 
    a   b   c
0   1   4   [1, 2]
1   2   5   [1, 2]
2   3   6   [1, 2]

... where c represents the coordinates of a centroid and I want to calculate the Euclidean distance row-wise between, for example, point (a, b) and centroid (1, 2). I want to replace column c with the point-to-centroid distance for each row.

I have the following code:

df['c'].apply(lambda x: ((x[0]-df['a'])**2 + (x[1]-df['b'])**2)**0.5)

I expect it to return a 1-dimensional vector (Series) of length len(df):

0    2.000000
1    3.162278
2    4.472136
dtype: float64

But it returns a dataframe instead:

    0   1           2
0   2.0 3.162278    4.472136
1   2.0 3.162278    4.472136
2   2.0 3.162278    4.472136

What is the cause of this behavior? How do I accomplish what I'm trying to do?


Solution

  • This is happening because of the way you use df['a'] and df['b'] in the lambda. This isn't accessing the value in the same row as df['c'], it's referring to the entire series in each column, so 3 results are produced, and they become 3 columns in the result.

    You can apply a function to the entire row, rather than just a single column. Specify axis=1 when calling df.apply().

    df.apply(lambda x: ((x['c'][0]-x['a'])**2 + (x['c'][1]-x['b'])**2)**0.5, axis=1)
    0    2.000000
    1    3.162278
    2    4.472136
    dtype: float64