Search code examples
rfiltercharacterdata-cleaning

ID names should have the same number of characters in it. How do I filter for data without the appropriate number of characters than delete that data?


I have a dataset where id names are all suppose to have 16 characters in it. How do I filter out all of the data that does not have exactly 16 characters so I can delete it from my dataset. I am working in R Studio.

I've tried both of these in attempt to get r to retrieve data that did not have exactly 16 characters in it but it did not work. I'm new to R so I'm still figuring it out.

length(all_trips$ride_id != 16)
length(nchar(all_trips$ride_id !=16))

Solution

  • You are getting closer and you are on the right track with nchar().

    I assume you have a data frame all_trips with a character column ride_id.

    Your first attempt:

    length(all_trips$ride_id != 16)
    

    translates as "find all the values of ride_id that are not equal to 16, then find the length of the vector containing those values". This probably returns a single number - not what we want.

    Your second attempt:

    length(nchar(all_trips$ride_id !=16))
    

    translates as "find all the values of ride_id that are not equal to 16, then count the characters in those values, then find the length of the vector containing the values". Again - not what we want.

    What you want to do is:

    "retain only the subset of all_trips where ride_id contains 16 characters"

    Which you can do like this:

    all_trips_filtered <- all_trips[nchar(all_trips$ride_id) == 16, ]
    

    Or another way using subset, where you can just specify the column name:

    all_trips_filtered <- subset(all_trips, nchar(ride_id) == 16)
    

    See ?Extract or ?subset for more help.