Search code examples
pythonpandasdataframecsvpunctuation

How to remove punctuation from one column of a dataframe?


I'm trying to remove punctuation from the column "text" using this code:

texttweet = pd.read_csv("../input/pfizer-vaccine-tweets/vaccination_tweets.csv")

i = 0
punct = "\n\r"+string.punctuation

for tweet in texttweet['text']:
    texttweet['text'][i] = tweet.translate(str.maketrans('', '', punct))
    i += 1

texttweet

But I'm getting this message although I'm getting the needed results:

A value is trying to be set on a copy of a slice from a DataFrame

So is it OK to keep my code regardless of the message or should I change something?


Solution

  • Best way to do that is this:

    texttweet = pd.read_csv("../input/pfizer-vaccine-tweets/vaccination_tweets.csv")
    punct = "\n\r"+string.punctuation
    texttweet['text'] = texttweet['text'].str.translate(str.maketrans('','',punct))
    texttweet
    

    For an explanation of the problem you were having see here: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy.

    Basically texttweet['text'] is a "slice" of a dataframe, and you are taking that slice and trying to assign something to it in position i.

    To avoid the error you can use texttweet.loc[i,'text'] = . This is different because it is being applied directly to the original dataframe, not a slice of it.