I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length
AA105LONDEN 03162017045262017 16953563ABCDEF
and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.
AA|105|LONDEN| |0316|2017|04526|2017| |1695|3563|ABCDEF
Currently I am trying to use the DBI package as with code that looks like the following:
library(DBI)
newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
(COLUMN1, COLUMN2, COLUMN3, COLUMN4)
SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
SUBSTR(TRIM(COLUMN_NAME),4,8),
SUBSTR(TRIM(COLUMN_NAME),9,12),
SUBSTR(TRIM(COLUMN_NAME),12,15)
FROM TABLE1
WHERE COLUMN_NAME IS NOT NULL")
I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length
AA105LONDEN 03162017045262017 16953563ABCDEF
and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.
AA|105|LONDEN| |0316|2017|04526|2017| |1695|3563|ABCDEF
Currently I am trying to use the DBI package as with code that looks like the following:
library(DBI)
newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
(COLUMN1, COLUMN2, COLUMN3, COLUMN4)
SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
SUBSTR(TRIM(COLUMN_NAME),4,8),
SUBSTR(TRIM(COLUMN_NAME),9,12),
SUBSTR(TRIM(COLUMN_NAME),12,15)
FROM TABLE1
WHERE COLUMN_NAME IS NOT NULL")
This however does not seem to work. In addition even if it did it just returns a R dataframe and I need it to work in a spark dataframe. I am looking for an alternative suggestion other then SQL or for the proper syntax that will work with spark and return a parsed Spark dataframe that I can then do ML on. Any ideas.
Thanks in advance.
Us this to create Dataframe. It will give correct result.
sqlContext.sql("select substr(COLUMN_NAME,1,2),
substr(COLUMN_NAME,3,3),
substr(COLUMN_NAME,6,6),
substr(COLUMN_NAME,12,4),
substr(COLUMN_NAME,16,4),
substr(COLUMN_NAME,20,4),
substr(COLUMN_NAME,24,5) ,
substr(COLUMN_NAME,29,4) ,
substr(COLUMN_NAME,33,1) ,
substr(COLUMN_NAME,34,4) ,
substr(COLUMN_NAME,38,4) ,
substr(COLUMN_NAME,42,6)
TABLE1
WHERE COLUMN_NAME IS NOT NULL")