Search code examples
pythonregextextnlp

separating and extracting part of strings of URLs using regex?


I have a df with variable named url. Each url string in url has a unique six character alphanumeric ID in the URL string. Ive been trying to extract a specific part of each string, the article_id from all urls, and then add it to the df as a new variable.

For example, xwpd7w is the article_id for https://www.vice.com/en_us/article/xwpd7w/how-a-brooklyn-gang-may-have-gotten-crazy-rich-dealing-for-el-chapo

How do I extract article_ids from all urls in the df based on their position next to /article/? Using any method, regex or not?

I have so far done the following:

df.url.str.split()

ex output: [https://www.vice.com/en_au/article/j539yy/smo...

df['cutcurls'] = df.url.str.join(sep=' ')
ex output: h t t p s : / / w w w . v i c e . c o m / e n

Any ideas?


Solution

  • Apply the "str.extract" method.

    df=pd.DataFrame({"url":["https://www.vice.com/en_us/article/xwpd7w/how-a-brooklyn-gang-may-have-gotten-crazy-rich-dealing-for-el-chapo","https://www.www.www//en_us/article/idId2019/buzzwords"]}) 
    
    df["articel_id"]= df.url.str.extract(r"/article/([^/]+)")
    
        Out:
            url articel_id
            0  https://www.vice.com/en_us/article/xwpd7w/how-...     xwpd7w
            1  https://www.www.www//en_us/article/idId2019/bu...   idId2019
    

    ([^/]+): groups consecutive non '/' characters