/[CANNOT_INFER_SCHEMA_FOR_TYPE/] Can not infer schema for type: `str`. Sometimes

In Databrick, we have a working notebook. Recently it is failing sometimes after we adding more data. The error is

[CANNOT_INFER_SCHEMA_FOR_TYPE] Can not infer schema for type: str.

But it may work again if run it again. So we think it maybe caused by memory issue.

The error is occurring on below lines:

activityraw = API_Request(cookie, urlPath='/activity',payload=params,method='get')
activityraw = spark.createDataFrame(activityraw)

May I have you advice, beside upgrade our server, anything I can perform to reduce the memory usage? Can we create dataframe using other library that consuming lesser memory?

Solution

I agree with @Steven When the schema is specified as a pyspark.sql.types.DataType or as a datatype string, it must accurately match the actual data format.

I have tried the below as an example:

schema = T.StructType([
    T.StructField("user_id", T.StringType(), True),
    T.StructField("activity", T.StringType(), True),
    T.StructField("timestamp", T.StringType(), True)  # Use StringType for datetime to parse later
])
activity_df = spark.createDataFrame(api_data, schema=schema)
activity_df = activity_df.withColumn("timestamp", to_timestamp(col("timestamp")))
activity_df.printSchema()
activity_df.show()

In the above code defining schema explicitly & creating DataFrame with schema. Converting 'timestamp' from StringType to TimestampType


root
 |-- user_id: string (nullable = true)
 |-- activity: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+-------+--------+-------------------+
|user_id|activity|          timestamp|
+-------+--------+-------------------+
|    123|   login|2023-11-06T12:34:56|
|    124|  logout|2023-11-06T13:00:00|
|    125|   login|2023-11-06T14:20:30|
+-------+--------+-------------------+

As you mentioned you want to reduce the memory usage?

You can try the below:

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName("appformemory")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .config("spark.driver.memory", "4g")
    .config("spark.executor.memory", "4g")
    .getOrCreate()
)

Results:

Arrow Optimization Enabled: true
Driver Memory: 4g
Executor Memory: 4g