Search code examples
dataframepyspark

How to read csv without header and name them with names while reading in pyspark?


100000,20160214,93374987
100000,20160214,1925301
100000,20160216,1896542
100000,20160216,84167419
100000,20160216,77273616
100000,20160507,1303015

I want to read the csv file which has no column names in first row. How to read it and name the columns with my specified names in the same time ? for now, I just renamed the original columns with my specified names like this:

df = spark.read.csv("user_click_seq.csv",header=False)
df = df.withColumnRenamed("_c0", "member_srl")
df = df.withColumnRenamed("_c1", "click_day")
df = df.withColumnRenamed("_c2", "productid")

Any better way ?


Solution

  • You can import the csv file into a dataframe with a predefined schema. The way you define a schema is by using the StructType and StructField objects. Assuming your data is all IntegerType data:

    from pyspark.sql.types import StructType, StructField, IntegerType
    
    schema = StructType([
        StructField("member_srl", IntegerType(), True),
        StructField("click_day", IntegerType(), True),
        StructField("productid", IntegerType(), True)])
    
    df = spark.read.csv("user_click_seq.csv",header=False,schema=schema)
    

    should work.