I have a dataframe with many columns and different experimental data. I would like to create sub-dataframes/groups based on the value of one of the column.
The column values look as below (just an example): column values
What is the smartest way to do this? As you can see from the figure, since these are experimental data, the column values have some variation but in this case I would like 5 main groups with values that are very close to each other and I would also like to delete all the "transition" points between one value to the next.
I have tried the following:
bin_range = [] # create general bin range:
for i in np.linspace(1,500,150):
bin_= i
for j in df["Vdot_out (hot), m3/h"].values:
if int(i) == int(j):
# print("bin", i)
#print("vdot value", j)
bin_ = i+1
bin_range.append(bin_)
df['bins'] = pd.cut(x=df["Column X"], bins=bin_range)
frequency =df['bins'].unique()
df['RN'] = df.groupby('bins').cumcount().add(1)
df['number of elements'] = df['bins'].map(df['bins'].value_counts())
df.drop(df[df['number of elements'] <20].index, inplace = True) # delete transition points
# Connsider the last 200 "stable" points
groups = df.groupby(df['bins'])
Now....this works sometimes, i.e. when I am lucky that the bin ranges include the 5 data groups with no overlap. However, sometimes it can happen that a bin range is maybe (0-8) and I have column values equal to 3 and 5. Then two groups would be incorporated in one. If I refine the range too much (like increasing the number of bin ranges), I risk that values very close to each other (like 2.4 and 2.6) end up in different bins (bins 0-2.5 / 2.5-5 / 5-7.5) There should be some smarter and cleaner way to do this! Any help is much appreciated :)
From what I understand you are looking for some kind of clustering algorithm. If you know that you are looking for exactly k
clusters in n
observations, you might try k-means like seen in this gist here.
# code copied from https://gist.github.com/lobrien/0f95b690644a862fb4dadb73afb7753b
from sklearn.cluster import KMeans
import numpy as np
data = np.array([101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471, 472, 557, 559, 562, 566, 569])
kmeans = KMeans(n_clusters=5).fit(data.reshape(-1,1))
kmeans.predict(data.reshape(-1,1))
# array([4, 4, 4, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 3,
# 3, 3, 3, 3], dtype=int32)
kmeans.cluster_centers_
#array([[ 337. ],
# [ 469.5 ],
# [ 203.75 ],
# [ 562.6 ],
# [ 104.66666667]])
Edit: about the "transitioning points": From the example graph you showed, I would suggest you use mean and standard-deviation to determine the ranges to be kept (e.g. µ+/-3 sigma in case each bins data follows a normal distribution)