python cluster-analysis data-science data-partitioning

1D Clustering with categorical variables

I have log operations which I try to analyse. For the analysis, I would like to learn whether a user is in a page/navigation mode or in the quiz mode (determined which kind of operations are more prevalent). The mode is given by the frequency of the operations as plotted in the following figure.

I would like to find - if available in the data - the boundaries of when there is a change in the phases. Of course there are always some outliers (e.g. consider the quiz point at 1452 in the figure).

I tried Jenks breaks for this matter: red are the breaks based on the navigation points, blue are the breaks based on the quiz points. I had to set a fixed number of bins which I set to 3. This, hence, does not seem very satisfactory for my problem.

I also considered KDE, but also there I would not know how to perform the split.

What approach is there to split the above data, telling me that somewhere between 2011 and 2049 (i.e., the last point of navigation and the first point of quiz) there is a change in the phase and somewhere between 4189 and 4199 (the last point in quiz and the first point in navigation)?

I am using Python for the data analysis (and pandas, numpy, etc.).

Solution

Use KDE. But think less of KMeans ("splits") and more of density.

If there density of state A is bigger, then there user is in mode A?

So just compare there densities. Try plotting intervals of the same majority density.