Search code examples
rdataframesubsetcbind

subset a dataframe with character string R


I am trying subset a dataframe by selecting a string condition in one of the columns of the dataframe. This should be a simple task and I see it has been asked many times before but I am completely stuck

A sample of the dataframe below

structure(list(Analyte = c("Fe", "SiO2", "Al2O3", "TiO2", "Zr"
), Category = c("Certified", "Certified", "Certified", "Certified", 
"Informational"), AssignedValue = c("57.2", "6.7497718955", "2.8925", 
"0.146635643433333", "0.00393333333333333"), Uncertainty = c("0.0587455625228403", 
"0.0164487575063948", "0.0114603084512766", "0.00242243266797717", 
NA), CILower = c("57.0448631590115", "6.67853259277291", "2.82556340328344", 
"0.141155720022072", "0.00289242352888054"), CIUpper = c("57.4618035076551", 
"6.83390972656042", "2.93457675661656", "0.152115566844594", 
"0.00497424313778613"), labCV.all = c("0.515527815366847", "1.64892092489221", 
"2.51730947074656", "5.4795936391998", "22.5584788355489"), totalSamples = c("65", 
"65", "65", "65", "36"), NoLabs = c("10", "10", "10", "10", "5"
), Sy = c("0.291421208884417", "0.108601127975891", "0.0761950799826298", 
"0.00766040470920629", NA), Uchar = c("0.0921554778554455", "0.0343426920867249", 
"0.0240949999243813", "0.00242243266797717", "0.0003749073959734"
)), row.names = c(1L, 2L, 3L, 4L, 24L), class = "data.frame")

I have tried the following

df2 <- df[df$Category == "Certified"]

However the new dataframe df2 is the same as the old.

I think it has something to do with the fact that dataframe is derived from a list of dataframes that was cbind together and the sturcture is not quite right?

when I check the data type typeof(df) I get list

I have tried many different ways to convert to a dataframe but it has made no difference.


Solution

  • You need a comma:

    df[df$Category == "Certified",]
    

    The trailing comma determines that you are subsetting the dataframe by rows.

    Not having the comma makes it that you subset the columns, as you can see since with the mix of row-wise and column-wise subsetting, the Sy column get's removed without the comma.

    Without the trailing comma converts the expression to column-wise like:

    df[, df$Category == "Certified"]
    

    Therefore, doing:

    > all(df[df$Category == "Certified"] == df[, df$Category == "Certified"], na.rm=T)
    [1] TRUE
    > 
    

    Would give TRUE, notice I use na.rm so that it won't become NA.