Search code examples
rintersectset-difference

Deleting/Subsetting Rows in R by setdiff/intersect


I am trying to delete rows out of my data set that contain certain vegetation types. I want to delete the rows from my unsurveyed data that have vegetation types not found in my surveyed data. I found a way to do this but am looking for a one-line method. I am currently doing this:

> setdiff(unsurveyed_1$VEGETATION, surveyed_1$VEGETATION)

Which returns seven vegetation types that I then delete doing this:

> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Acer rubrum- Nyssa sylvatica saturated forest alliance",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Acer rubrum/Quercus coccinea-Acer rubrum-Vaccinium corybosum-Vaccinium palladium",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Building",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Parking Lot",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Prunus serotina",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Typha (angustifolia, latifolia) - (Schoenoplectus spp.) Eastern Herbaceous Vegetation",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Water",]

I tried a few different options including subsetting with little success so far figuring that would be my best option. I am also looking to something similar with intersect but am assuming that it would similar answers.

EDIT: In addition to using the code that @Cath supplied I edited it to get the opposite as well.

> unsurveyed_2 <- unsurveyed_2[unsurveyed_2$VEGETATION %in% setdiff(unsurveyed_2$VEGETATION, surveyed_1$VEGETATION), ]

Solution

  • The obvious would be to do:

    ID <- unsurveyed_1$VEGETATION %in% unique(surveyed_1$VEGETATION)
    unsurveyed1 <- unsurveyed1[ID,]
    

    You use a logical vector ID as a row index to select the rows you want to keep. ID has a value TRUE for every row where unsurveyed1$VEGETATION can be found back in surveyed1$VEGETATION and FALSE otherwise. Using the unique values in surveyed1$VEGETATION just increases performance if you have a whole lot of data and not too many different vegetation types.

    So there is no need whatsoever to use setdiff() and definitely even less need to copy every single result in a new line. Please start thinking in terms of temporary objects when working in R. That'll make your programming life a whole lot easier.

    EDIT: This is exactly what @Cath did in his/her comment in a single line.


    In case you insist on using setdiff(), then this would be quite less typing work:

    thediff <- setdiff(unsurveyed_1$VEGETATION, surveyed_1$VEGETATION)
    ID <- unsurveyed_1$VEGETATION %in% thediff
    unsurveyed1 <- unsurveyed1[!ID,]
    

    Note that you have to invert the ID vector using the NOT (!) operator to drop all those lines where the unsurveyed vegetation matches a value in thediff.

    On a sidenote: the internal code of setdiff() and %in% is almost exactly the same. The difference is that setdiff() returns actual values not found in the second vector, and %in% returns a logical vector that says FALSE if the value wasn't found in the second vector.