Search code examples
pandascluster-analysisdbscan

How convert timestamp, datatime to number before to apply DBSCAN


I am preparing my dataset to apply DBSCAN clustering. Before to do this I need to convert all my features to numbers in order to use StandardScaler(). My problem is that I am fighting with timestamp and datatime. I dropped out the day and timestamp columns and left only the Time column in seconds that appears to be integer. However I still get error like

X = StandardScaler().fit_transform(X)
TypeError: float() argument must be a string or a number, not 'Timestamp'

Thanks a lot in advance

 duration             float64
 power                float64
 duration_2           float64
 duration_2_energy    float64
 time2                int64
 dtype: object

Solution

  • Don't standard scale everything. It's more often a bad idea than a good idea. Because eyou destroy information.

    Instead, read the article on generalized DBSCAN by the DBSCAN authors. It shows how to use more complex data correctly.

    Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998).
    Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications.
    Data Mining and Knowledge Discovery. Berlin: Springer-Verlag. 2 (2): 169–194. doi:10.1023/A:1009745219419.

    Here, you will probably want to use multiple epsilon thresholds. For example you want a threshold on time of a day, and an additional threshold on the numeric attributes.