Search code examples
rapache-spark-sqlsql-likesparkr

How to select in R all columns with a common tag?


I have a SparkDataFrame with hundreds of columns and I would like to only select those with the word "time" on it.

Instead of doing something like this having to write all the names of the column manually:

df_selection <- SparkR::select(df, "TRT_time", "Flight_time", "SS_time", ...)

I would like to do something like:

df_selection <- SparkR::select(df, like('^time.*?'))

Is there a nice way of doing this in SparkR? An equivalent for the SQL like operator? I'm a total begginer in R, sorry if it is quite simple question but I couldn't get my brain around it.

Thank you!


Solution

  • You can access the column names using names(), even if it is a SparkDataFrame. And then you can use grepl() to find matches to a pattern.

    cols <- names(df)
    
    head(select(df, cols[grepl('^time.*?', names(df))]))
    

    Which returns:

                time_hour
    1 2013-01-01 06:00:00
    2 2013-01-01 06:00:00
    3 2013-01-01 06:00:00
    4 2013-01-01 06:00:00
    5 2013-01-01 07:00:00
    6 2013-01-01 06:00:00
    

    Update after comment:

    The best thing would be to include it with an OR i.e. '|' where you subset col.

    head(SparkR::select(df, cols[grepl('^time.*?', names(df)) | cols == "airline_iata"]))
    

    Note: You could also create a character vector like this c(cols[grepl('^time.*?', names(df))], 'airline_iata') but Spark throws an error if a column name is not found. airline_iata'is not innames(df)`.