Search code examples
pythonpandasdataframepython-refindall

Find all website links, group and count from column of dataframe - Python


I have a dataframe with the following columns: Date,Time,Tweet,Client,Client Simplified The column Tweet contains sometimes a website link. I am trying to define a function which extract the number of times this link is showed in the tweet and which link it is.

I don't want the answer of the whole function. I am now struggling with the function findall, before I program all this into a function:

import pandas as pd
import re

csv_doc = pd.read_csv("/home/datasci/prog_datasci_2/activities/activity_2/data/TrumpTweets.csv")

URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc)

The error I'm getting is:

TypeError                                 Traceback (most recent call last)
<ipython-input-20-0085f7a99b7a> in <module>
      7 # csv_doc.head()
      8 tweets = csv_doc.Tweet
----> 9 URL= re.split('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',tweets)
     10 
     11 # URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc[Tweets])

/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
    229     and the remainder of the string is returned as the final element
    230     of the list."""
--> 231     return _compile(pattern, flags).split(string, maxsplit)
    232 
    233 def findall(pattern, string, flags=0):

TypeError: expected string or bytes-like object

Could you please let me know what is wrong? Thanks.


Solution

    1. try to add r in front of the string. It will tell Python that this is a regex pattern

    2. also re package mostly work on single string, not list or series of string. You can try to use a simple list comprehension like this :

    [re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in csv_doc.Tweet]