I've got a .csv file with | delimiter and the following command fits me, exept for some records which contain delimiter inside a name of an artist
val df_artist = spark.read.format("com.databricks.spark.csv").option("header", "false").option("delimiter", "|").option("mode", "DROPMALFORMED").load("./artists.csv");
ordinary:
ARVYNKF1272BA83168|TRAGGZG128F92F9B5E|Black Devil Disco Club
irregular:
ARQJXPL1187B9B4D6B|TRWUJAE128F92E3E7D|M|A|R|R|S
ARZCIDG1187B992CE3||Kid Creole | The Coconuts
is it possible to make first two columns and everything what will be left put to third column, ignoring delimiters?
I use Spark 2.1.1, if it is important.
I would suggest to use sparkContext
textFile
to read the csv
file and split
the line with |
delimiter and choose first two strings as first two columns and the rest as the third column. The working code for this is
val data = sc.textFile("./artists.csv")
data.map(line => line.split("\\|"))
.map(array => (array(0), array(1), array.drop(2)))
.toDF("rowId", "ticketId", "movies")
.show(false)
Given, the input file contains data as
ARVYNKF1272BA83168|TRAGGZG128F92F9B5E|Black Devil Disco Club
ARQJXPL1187B9B4D6B|TRWUJAE128F92E3E7D|M|A|R|R|S
ARZCIDG1187B992CE3||Kid Creole | The Coconuts
Output of above code would be
+------------------+------------------+----------------------------+
|rowId |ticketId |movies |
+------------------+------------------+----------------------------+
|ARVYNKF1272BA83168|TRAGGZG128F92F9B5E|[Black Devil Disco Club] |
|ARQJXPL1187B9B4D6B|TRWUJAE128F92E3E7D|[M, A, R, R, S] |
|ARZCIDG1187B992CE3| |[Kid Creole , The Coconuts]|
+------------------+------------------+----------------------------+