Search code examples
rapache-sparksparklyrr-dbi

SparklyR/Spark SQL split string into multiple columns based on number of bites/character count


I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length

 AA105LONDEN    03162017045262017 16953563ABCDEF

and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.

 AA|105|LONDEN|    |0316|2017|04526|2017| |1695|3563|ABCDEF

Currently I am trying to use the DBI package as with code that looks like the following:

 library(DBI)

 newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
 (COLUMN1, COLUMN2, COLUMN3, COLUMN4)
  SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
           SUBSTR(TRIM(COLUMN_NAME),4,8),
           SUBSTR(TRIM(COLUMN_NAME),9,12),
           SUBSTR(TRIM(COLUMN_NAME),12,15)
  FROM TABLE1
  WHERE COLUMN_NAME IS NOT NULL")

I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length

 AA105LONDEN    03162017045262017 16953563ABCDEF

and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.

 AA|105|LONDEN|    |0316|2017|04526|2017| |1695|3563|ABCDEF

Currently I am trying to use the DBI package as with code that looks like the following:

 library(DBI)

 newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
 (COLUMN1, COLUMN2, COLUMN3, COLUMN4)
  SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
           SUBSTR(TRIM(COLUMN_NAME),4,8),
           SUBSTR(TRIM(COLUMN_NAME),9,12),
           SUBSTR(TRIM(COLUMN_NAME),12,15)
  FROM TABLE1
  WHERE COLUMN_NAME IS NOT NULL")

This however does not seem to work. In addition even if it did it just returns a R dataframe and I need it to work in a spark dataframe. I am looking for an alternative suggestion other then SQL or for the proper syntax that will work with spark and return a parsed Spark dataframe that I can then do ML on. Any ideas.

Thanks in advance.


Solution

  • Us this to create Dataframe. It will give correct result.

        sqlContext.sql("select substr(COLUMN_NAME,1,2),
    substr(COLUMN_NAME,3,3),
    substr(COLUMN_NAME,6,6),
    substr(COLUMN_NAME,12,4),
    substr(COLUMN_NAME,16,4),
    substr(COLUMN_NAME,20,4), 
    substr(COLUMN_NAME,24,5) ,
    substr(COLUMN_NAME,29,4) ,
    substr(COLUMN_NAME,33,1) ,
    substr(COLUMN_NAME,34,4) ,
    substr(COLUMN_NAME,38,4) ,
    substr(COLUMN_NAME,42,6) 
    TABLE1
    WHERE COLUMN_NAME IS NOT NULL")