I'm trying to write a k-means algorithm from scratch. Suppose I have the following dataframe...
df =
a b c
0 1 4 [1, 2]
1 2 5 [1, 2]
2 3 6 [1, 2]
... where c
represents the coordinates of a centroid and I want to calculate the Euclidean distance row-wise between, for example, point (a, b) and centroid (1, 2). I want to replace column c
with the point-to-centroid distance for each row.
I have the following code:
df['c'].apply(lambda x: ((x[0]-df['a'])**2 + (x[1]-df['b'])**2)**0.5)
I expect it to return a 1-dimensional vector (Series) of length len(df):
0 2.000000
1 3.162278
2 4.472136
dtype: float64
But it returns a dataframe instead:
0 1 2
0 2.0 3.162278 4.472136
1 2.0 3.162278 4.472136
2 2.0 3.162278 4.472136
What is the cause of this behavior? How do I accomplish what I'm trying to do?
This is happening because of the way you use df['a']
and df['b']
in the lambda. This isn't accessing the value in the same row as df['c']
, it's referring to the entire series in each column, so 3 results are produced, and they become 3 columns in the result.
You can apply a function to the entire row, rather than just a single column. Specify axis=1
when calling df.apply()
.
df.apply(lambda x: ((x['c'][0]-x['a'])**2 + (x['c'][1]-x['b'])**2)**0.5, axis=1)
0 2.000000
1 3.162278
2 4.472136
dtype: float64