Search code examples
pythonapache-sparkpysparkpyspark-pandas

pyspark.pandas: Converting float64 column to TimedeltaIndex


I want to convert a numeric column which is resembling a timedelta in seconds to a ps.TimedeltaIndex (for the purpose of later resampling the dataset)

import pyspark.pandas as ps

df = ps.DataFrame({"time": [2.0, 3.0, 4.0], "x": [4.5, 4.0, 3.5]})
df.set_index(ps.to_timedelta(df.time, "s").to_numpy())

KeyError: '2000000000 nanoseconds'

I don't understand why this doesn't work.


Solution

  • The answer of @koedlt brought me on the right track, but is still missing the conversion to TimedeltaIndex

    df = ps.DataFrame({"time": [2.0, 3.0, 4.0], "x": [4.5, 4.0, 3.5]})
    df["time"] = ps.to_timedelta(df.time, unit="s")
    df.set_index("time", inplace=True)
    

    However I also realised that resample I mentioned requires actually a DatetimeIndex, so I should have asked for that. We'd need to use ps.to_datetime(df.time, unit="s") instead of ps.to_timedelta in this case