apache-spark pyspark datetime-format calculated-columns azure-databricks

"startPos and length must be the same type" error when trying to add a column in a dataframe

I have a .csv file that I'm trying to modify with Azure Databricks using Python.

The file has two columns timeOfMeasurement with the format 2021-01-04T07:07:45.098+0000 and eventProcessedUtcTime with the format 2021-01-04T07:07:45.6768856Z. I want to add new columns with these strings formatted to datetime.

The following formatting works when I use single variables:

import datetime

# EventProcessedUtcTime
a = '2021-01-04T07:07:45.6768856Z'
x = datetime.datetime.strptime((a[:26]).strip(), '%Y-%m-%dT%H:%M:%S.%f')
print(x)
print(type(x))

# timeOfMeasurement
b = '2021-01-04T07:07:45.098+0000'
y = datetime.datetime.strptime((b[:23]).strip(), '%Y-%m-%dT%H:%M:%S.%f')
#y = (b[:23]).strip()
print(y)
print(type(y))

and creates the following output:

2021-01-04 07:07:45.676885
<class 'datetime.datetime'>
2021-01-04 07:07:45.098000
<class 'datetime.datetime'>

But when I'm trying to implement this for the whole dataframe

  import datetime  
  df = spark.read.format('csv').options(header='true', inferSchema='true').load('/mnt/data/values.csv')
  df_adjusted = (df.withColumn("timeOfMeasurementDatetime", datetime.datetime.strptime((df.timeOfMeasurement[:23]).strip(), '%Y-%m-%dT%H:%M:%S.%f'))

I get an error saying:

TypeError: startPos and length must be the same type. Got <class 'NoneType'> and <class 'int'>, respectively.

I have no idea what the problem is because I'm compleatly new to the whole subject and woud be really happy if someone could help me.

Solution

You need to use PySpark functions for that instead of the using Python's built-in functions. You have following possibilities:

if your strings are in the ISO8601 time format, then you can just do the .cast('timestamp') on corresponding columns:

>>> df = spark.createDataFrame([('2021-01-04T07:07:45.098+0000',  
                                 '2021-01-04T07:07:45.6768856Z')], 
                               ['ts1','ts2'])
>>> df.show(truncate=False)
+----------------------------+----------------------------+
|ts1                         |ts2                         |
+----------------------------+----------------------------+
|2021-01-04T07:07:45.098+0000|2021-01-04T07:07:45.6768856Z|
+----------------------------+----------------------------+
>>> df.select(df.ts1.cast('timestamp'), df.ts2.cast('timestamp')).show(truncate=False)
+-----------------------+--------------------------+
|ts1                    |ts2                       |
+-----------------------+--------------------------+
|2021-01-04 08:07:45.098|2021-01-04 08:07:45.676885|
+-----------------------+--------------------------+

if your strings have some custom format, you can use built-in functions like to_timestamp, and provide format string.