Search code examples
dataframecsvpyspark

Spark - How to read csv file by spark in the right way?


I have a csv file and it look like that:

Title,Description,Authors

Dr. Seuss: American Icon,"Philip Nel takes a fascinating look into the key aspects of Seuss's career - his poetry, politics, art, marketing, and place in the popular imagination."" ""Nel argues convincingly that Dr. Seuss is one of the most influential poets in America. His nonsense verse, like that of Lewis Carroll and Edward Lear, has changed language itself, giving us new words like ""nerd."" And Seuss's famously loopy artistic style - what Nel terms an ""energetic cartoon surrealism"" - has been equally important, inspiring artists like filmmaker Tim Burton and illustrator Lane Smith. --from back cover",['Philip Nel']

I read this file to a spark dataframe:

df_fake= spark.read.option("header","true").csv("C:\\Users\\KhanhDV8\\Desktop\\fake.csv")
df_fake.show()

and I want this dataframe:

Title Description Authors
Dr. Seuss: American Icon "Philip Nel takes a fascinating look into the key aspects of Seuss's career - his poetry, politics, art, marketing, and place in the popular imagination."" ""Nel argues convincingly that Dr. Seuss is one of the most influential poets in America. His nonsense verse, like that of Lewis Carroll and Edward Lear, has changed language itself, giving us new words like ""nerd."" And Seuss's famously loopy artistic style - what Nel terms an ""energetic cartoon surrealism"" - has been equally important, inspiring artists like filmmaker Tim Burton and illustrator Lane Smith. --from back cover" ['Philip Nel']

But the result is :

Title Description Authors
Dr. Seuss: American Icon "Philip Nel takes a fascinating look into the key aspects of Seuss's career - his poetry, politics, art, marketing, and place in the popular imagination."" ""Nel argues convincingly that Dr. Seuss is one of the most influential poets in America. His nonsense verse like that of Lewis Carroll and Edward Dear

Is there anyway to handle this case ?

I just don't have any idea to handle this. This data come from a large file csv( 3m record). Most of records which have short or null "Description" were read correctly and the others have the wrong format.


Solution

  • Spark fails because your string contains " in the csv, Add .option("quote", "\"").option("escape", "\"") to solve it