I have a .csv file that I'm trying to modify with Azure Databricks using Python.
The file has two columns timeOfMeasurement
with the format 2021-01-04T07:07:45.098+0000
and eventProcessedUtcTime
with the format 2021-01-04T07:07:45.6768856Z
. I want to add new columns with these strings formatted to datetime.
The following formatting works when I use single variables:
import datetime
# EventProcessedUtcTime
a = '2021-01-04T07:07:45.6768856Z'
x = datetime.datetime.strptime((a[:26]).strip(), '%Y-%m-%dT%H:%M:%S.%f')
print(x)
print(type(x))
# timeOfMeasurement
b = '2021-01-04T07:07:45.098+0000'
y = datetime.datetime.strptime((b[:23]).strip(), '%Y-%m-%dT%H:%M:%S.%f')
#y = (b[:23]).strip()
print(y)
print(type(y))
and creates the following output:
2021-01-04 07:07:45.676885
<class 'datetime.datetime'>
2021-01-04 07:07:45.098000
<class 'datetime.datetime'>
But when I'm trying to implement this for the whole dataframe
import datetime
df = spark.read.format('csv').options(header='true', inferSchema='true').load('/mnt/data/values.csv')
df_adjusted = (df.withColumn("timeOfMeasurementDatetime", datetime.datetime.strptime((df.timeOfMeasurement[:23]).strip(), '%Y-%m-%dT%H:%M:%S.%f'))
I get an error saying:
TypeError: startPos and length must be the same type. Got <class 'NoneType'> and <class 'int'>, respectively.
I have no idea what the problem is because I'm compleatly new to the whole subject and woud be really happy if someone could help me.
You need to use PySpark functions for that instead of the using Python's built-in functions. You have following possibilities:
.cast('timestamp')
on corresponding columns:>>> df = spark.createDataFrame([('2021-01-04T07:07:45.098+0000',
'2021-01-04T07:07:45.6768856Z')],
['ts1','ts2'])
>>> df.show(truncate=False)
+----------------------------+----------------------------+
|ts1 |ts2 |
+----------------------------+----------------------------+
|2021-01-04T07:07:45.098+0000|2021-01-04T07:07:45.6768856Z|
+----------------------------+----------------------------+
>>> df.select(df.ts1.cast('timestamp'), df.ts2.cast('timestamp')).show(truncate=False)
+-----------------------+--------------------------+
|ts1 |ts2 |
+-----------------------+--------------------------+
|2021-01-04 08:07:45.098|2021-01-04 08:07:45.676885|
+-----------------------+--------------------------+