Search code examples
pythontwitter

How to remove @user, hashtag, and links from tweet text and put it into dataframe in python


I'm a begginer at python and I'm trying to gather data from twitter using the API. I want to gather username, date, and the clean tweets without @username, hashtags and links and then put it into dataframe.

I find a way to achieve this by using : ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split()) but when I implement it on my codes, it returns NameError: name 'tweet' is not defined

Here is my codes

tweets = tw.Cursor(api.search, q=keyword, lang="id", since=date).items()

raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split())

data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet] for tweet in tweets]

dataFrame = pd.DataFrame(data=data_tweet, columns=['user', "date", "tweet"])

I know the problem is in the data_tweet, but I don't know how to fix it. Please help me

Thank you.


Solution

  • The problem is actually in the second line:

    raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split())
    

    Here, you are using tweet.text. However, you have not defined what tweet is yet, only tweets. Also, from reading your third line where you actually define tweet:

    for tweet in tweets
    

    I'm assuming you want tweet to be the value you get while iterating through tweets. So what you have to do is to run both lines through an iterator together, assuming my earlier hypothesis is correct. So:

    for tweet in tweets:
        raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split())
        data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet]]