Search code examples
rtextunicodeasciidata-cleaning

Cleaning a column with break spaces that obtain last, first name so I can filter it from my data frame


I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.

I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.

surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")

surveyData %>% filter(`Completed By` == "Spencer, Amy")

in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.

Expected results is that I get the rows that are attached to what ever name I filter by.


Solution

  • I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:

     space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
    
    surveyData$`Completed By Clean` = space_fix 
    

    Once, I applied this I could easily filter any name!

    Thanks everyone!