I have a SparkDataFrame
with hundreds of columns and I would like to only select those with the word "time" on it.
Instead of doing something like this having to write all the names of the column manually:
df_selection <- SparkR::select(df, "TRT_time", "Flight_time", "SS_time", ...)
I would like to do something like:
df_selection <- SparkR::select(df, like('^time.*?'))
Is there a nice way of doing this in SparkR? An equivalent for the SQL like operator? I'm a total begginer in R, sorry if it is quite simple question but I couldn't get my brain around it.
Thank you!
You can access the column names using names()
, even if it is a SparkDataFrame
. And then you can use grepl()
to find matches to a pattern.
cols <- names(df)
head(select(df, cols[grepl('^time.*?', names(df))]))
Which returns:
time_hour
1 2013-01-01 06:00:00
2 2013-01-01 06:00:00
3 2013-01-01 06:00:00
4 2013-01-01 06:00:00
5 2013-01-01 07:00:00
6 2013-01-01 06:00:00
Update after comment:
The best thing would be to include it with an OR
i.e. '|'
where you subset col
.
head(SparkR::select(df, cols[grepl('^time.*?', names(df)) | cols == "airline_iata"]))
Note: You could also create a character vector like this c(cols[grepl('^time.*?', names(df))], 'airline_iata')
but Spark throws an error if a column name is not found. airline_iata'is not in
names(df)`.