Search code examples
apache-sparkpysparkdatetime-formatcalculated-columnsazure-databricks

"startPos and length must be the same type" error when trying to add a column in a dataframe


I have a .csv file that I'm trying to modify with Azure Databricks using Python.

The file has two columns timeOfMeasurement with the format 2021-01-04T07:07:45.098+0000 and eventProcessedUtcTime with the format 2021-01-04T07:07:45.6768856Z. I want to add new columns with these strings formatted to datetime.

The following formatting works when I use single variables:

import datetime

# EventProcessedUtcTime
a = '2021-01-04T07:07:45.6768856Z'
x = datetime.datetime.strptime((a[:26]).strip(), '%Y-%m-%dT%H:%M:%S.%f')
print(x)
print(type(x))

# timeOfMeasurement
b = '2021-01-04T07:07:45.098+0000'
y = datetime.datetime.strptime((b[:23]).strip(), '%Y-%m-%dT%H:%M:%S.%f')
#y = (b[:23]).strip()
print(y)
print(type(y))

and creates the following output:

2021-01-04 07:07:45.676885
<class 'datetime.datetime'>
2021-01-04 07:07:45.098000
<class 'datetime.datetime'>

But when I'm trying to implement this for the whole dataframe

  import datetime  
  df = spark.read.format('csv').options(header='true', inferSchema='true').load('/mnt/data/values.csv')
  df_adjusted = (df.withColumn("timeOfMeasurementDatetime", datetime.datetime.strptime((df.timeOfMeasurement[:23]).strip(), '%Y-%m-%dT%H:%M:%S.%f'))

I get an error saying:

TypeError: startPos and length must be the same type. Got <class 'NoneType'> and <class 'int'>, respectively.

I have no idea what the problem is because I'm compleatly new to the whole subject and woud be really happy if someone could help me.


Solution

  • You need to use PySpark functions for that instead of the using Python's built-in functions. You have following possibilities:

    • if your strings are in the ISO8601 time format, then you can just do the .cast('timestamp') on corresponding columns:
    >>> df = spark.createDataFrame([('2021-01-04T07:07:45.098+0000',  
                                     '2021-01-04T07:07:45.6768856Z')], 
                                   ['ts1','ts2'])
    >>> df.show(truncate=False)
    +----------------------------+----------------------------+
    |ts1                         |ts2                         |
    +----------------------------+----------------------------+
    |2021-01-04T07:07:45.098+0000|2021-01-04T07:07:45.6768856Z|
    +----------------------------+----------------------------+
    >>> df.select(df.ts1.cast('timestamp'), df.ts2.cast('timestamp')).show(truncate=False)
    +-----------------------+--------------------------+
    |ts1                    |ts2                       |
    +-----------------------+--------------------------+
    |2021-01-04 08:07:45.098|2021-01-04 08:07:45.676885|
    +-----------------------+--------------------------+
    
    • if your strings have some custom format, you can use built-in functions like to_timestamp, and provide format string.