Search code examples
urltwitter4jsentiment-analysishashtagtweets

How to remove hashtag, user mentions & URLs from tweet. Twitter4j library(sentiment analysis) does not work properly with these noise words


How to remove hashtag, user mentions & URLs from tweet. Twitter4j library(sentiment analysis) does not work properly with these noise words

Example: Tweet: Hello great morning today #summermorning @evilpriest @holysinner https://goo.le/asxmo/dataload.......

Should look like - Hello great morning today summermorning

Is there any method or utility available in twitter4J itself or we need to write our own? Please guide.


Solution

  • Use regular expressions to filter out the #es before parsing a sentence through the sentiment analysis pipeline! Use this:

    String withoutHashTweet = originalTweet.replaceAll("[#]", "");
    

    So "Hello great morning today #summermorning @evilpriest @holysinner " should return : "Hello great morning today summermorning @evilpriest @holysinner"

    Similarly replace the hash in the code with @ to remove the respective sign