Search code examples
pythonapache-sparkazure-databricksspark-notebook

Apache Spark unable to recognize columns in UTF-16 csv file


Question: Why I am getting following error on the last line of the code below, how the issue can be resolved?

AttributeError: 'DataFrame' object has no attribute 'OrderID'

CSV File encoding: UTF-16 LE BOM

Number of columns: 150

Rows: 5000

Language etc.: Python, Apache Spark, Azure-Databricks

MySampleDataFile.txt:

FirstName~LastName~OrderID~City~.....
Kim~Doe~1234~New York~...............
Bob~Mason~456~Seattle~...............
..................

Code sample:

from pyspark.sql.types import DoubleType
df = spark.read.option("encoding","UTF-16LE").option("multiline","true").csv("abfss://[email protected]/myFolder/MySampleDataFile.txt", sep='~', escape="\"", header="true", inferSchema="false")
    
    display(df.limit(4))
df1 = df.withColumn("OrderID", df.OrderID.cast(DoubleType()))

Output of display(df.limit(4)) It successfully displays the content of df in a tabular format with column header - similar to the example here:

---------------------------------------
|FirstName|LastName|OrderID|City|.....|
---------------------------------------
|Kim~Doe|1234|New York|...............|
|Bob|Mason|456|Seattle|...............|
|................                     |
---------------------------------------

Solution

  • AttributeError: 'DataFrame' object has no attribute 'OrderID'

    how the issue can be resolved?

    You can try the following way to change the data type.

    df1 = df.withColumn("OrderID", df[“OrderID”].cast(DoubleType()))
    

    OR - Alternative way,

    pyspark.sql.functions.col It will return a column depending on the name of the provided column.

    from pyspark.sql.functions import col  
    df1 = df.withColumn("OrderID", col("OrderID").cast(DoubleType()))