In Databrick, we have a working notebook. Recently it is failing sometimes after we adding more data. The error is
[CANNOT_INFER_SCHEMA_FOR_TYPE] Can not infer schema for type:
str
.
But it may work again if run it again. So we think it maybe caused by memory issue.
The error is occurring on below lines:
activityraw = API_Request(cookie, urlPath='/activity',payload=params,method='get')
activityraw = spark.createDataFrame(activityraw)
May I have you advice, beside upgrade our server, anything I can perform to reduce the memory usage? Can we create dataframe using other library that consuming lesser memory?
I agree with @Steven When the schema is specified as a pyspark.sql.types.DataType
or as a datatype
string, it must accurately match the actual data format.
I have tried the below as an example:
schema = T.StructType([
T.StructField("user_id", T.StringType(), True),
T.StructField("activity", T.StringType(), True),
T.StructField("timestamp", T.StringType(), True) # Use StringType for datetime to parse later
])
activity_df = spark.createDataFrame(api_data, schema=schema)
activity_df = activity_df.withColumn("timestamp", to_timestamp(col("timestamp")))
activity_df.printSchema()
activity_df.show()
In the above code defining schema explicitly & creating DataFrame with schema.
Converting 'timestamp
' from StringType
to TimestampType
root
|-- user_id: string (nullable = true)
|-- activity: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
+-------+--------+-------------------+
|user_id|activity| timestamp|
+-------+--------+-------------------+
| 123| login|2023-11-06T12:34:56|
| 124| logout|2023-11-06T13:00:00|
| 125| login|2023-11-06T14:20:30|
+-------+--------+-------------------+
As you mentioned you want to reduce the memory usage?
You can try the below:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("appformemory")
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.config("spark.driver.memory", "4g")
.config("spark.executor.memory", "4g")
.getOrCreate()
)
Results:
Arrow Optimization Enabled: true
Driver Memory: 4g
Executor Memory: 4g