Search code examples
pysparkazure-databricks

/[CANNOT_INFER_SCHEMA_FOR_TYPE/] Can not infer schema for type: `str`. Sometimes


In Databrick, we have a working notebook. Recently it is failing sometimes after we adding more data. The error is

[CANNOT_INFER_SCHEMA_FOR_TYPE] Can not infer schema for type: str.

But it may work again if run it again. So we think it maybe caused by memory issue.

The error is occurring on below lines:

activityraw = API_Request(cookie, urlPath='/activity',payload=params,method='get')
activityraw = spark.createDataFrame(activityraw)

May I have you advice, beside upgrade our server, anything I can perform to reduce the memory usage? Can we create dataframe using other library that consuming lesser memory?


Solution

  • I agree with @Steven When the schema is specified as a pyspark.sql.types.DataType or as a datatype string, it must accurately match the actual data format.

    I have tried the below as an example:

    schema = T.StructType([
        T.StructField("user_id", T.StringType(), True),
        T.StructField("activity", T.StringType(), True),
        T.StructField("timestamp", T.StringType(), True)  # Use StringType for datetime to parse later
    ])
    activity_df = spark.createDataFrame(api_data, schema=schema)
    activity_df = activity_df.withColumn("timestamp", to_timestamp(col("timestamp")))
    activity_df.printSchema()
    activity_df.show()
    

    In the above code defining schema explicitly & creating DataFrame with schema. Converting 'timestamp' from StringType to TimestampType

    
    root
     |-- user_id: string (nullable = true)
     |-- activity: string (nullable = true)
     |-- timestamp: timestamp (nullable = true)
    
    +-------+--------+-------------------+
    |user_id|activity|          timestamp|
    +-------+--------+-------------------+
    |    123|   login|2023-11-06T12:34:56|
    |    124|  logout|2023-11-06T13:00:00|
    |    125|   login|2023-11-06T14:20:30|
    +-------+--------+-------------------+
    
    
    

    As you mentioned you want to reduce the memory usage?

    You can try the below:

    from pyspark.sql import SparkSession
    spark = (
        SparkSession.builder
        .appName("appformemory")
        .config("spark.sql.execution.arrow.pyspark.enabled", "true")
        .config("spark.driver.memory", "4g")
        .config("spark.executor.memory", "4g")
        .getOrCreate()
    )
    

    Results:

    Arrow Optimization Enabled: true
    Driver Memory: 4g
    Executor Memory: 4g