Search code examples
rstringtextreplacekaggle

Removing single quotes in R


I'm making some wordclouds for a project on kaggle, but this line of code isn't working. I am trying to remove all the apostrophes from a column containing text. In my corups "'s" and "'re" are two fo my most frequent "words". While the data is still in the form of a data frame I have been using this line of code df$col <- gsub("\'","", df$col).

Below is some sample data. In my kaggle project, the text data comes in a column of a dataframe. Am I missing something? I've also tried str_replace_all and sub.

EDIT: dput(head(df))

structure(list(X1 = c(0, 1, 2, 3, 4, 5), Character = c("Michael", 
"Jim", "Michael", "Jim", "Michael", "Michael"), Line = c("All right Jim. Your quarterlies look very good. How are things at the library?", 
"Oh, I told you. I couldn’t close it. So…", "So you’ve come to the master for guidance? Is this what you’re saying, grasshopper?", 
"Actually, you called me in here, but yeah.", "All right. Well, let me show you how it’s done.", 
"[on the phone] Yes, I’d like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. [quick cut scene] All right. Done deal. Thank you very much, sir. You’re a gentleman and a scholar. Oh, I’m sorry. OK. I’m sorry. My mistake. [hangs up] That was a woman I was talking to, so… She had a very low voice. Probably a smoker, so… [Clears throat] So that’s the way it’s done."
), Season = c(1, 1, 1, 1, 1, 1), Episode_Number = c(1, 1, 1, 
1, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
"data.frame"))

EDIT 2: Previously I stated that df$col <- gsub("\'","", df$col) worked in R studio. That was only true on toy data. I used it on the dput and it didnt work, so I'm back to square one.


Solution

  • Your input has "fancy quotes", not standard quotes. This should get rid of all fancy single and double quotes and all non-fancy single quotes:

    gsub("['‘’”“]", "", df$Line)