Search code examples
pythonregexpython-3.xdata-cleaning

How to fix a regular expression form for scraped url data via python?


I am trying to clean my url data using regular expression. I have already cleaned it bypass, but I have a last problem that I don't know how to solve.

It is a data that I have scraped from some newshub and it consists from theme part and a source part.

I need to scrape the source pattern from url and leave out the theme part in order to put it on to the numpy array for the further analysis.

My scraped urls look like this:

/video/36225009-report-cnbc-russian-sanctions-ukraine/

/health/36139780-cancer-rates-factors-of-stomach/

/business/36187789-in-EU-IMF-reports-about-world-economic-environment/

/video/35930625-30stm-in-last-tour-tv-album-o-llfl-/?smi2=1

/head/36214416-GB-brexit-may-stops-process-by/

/cis/36189830-kiev-arrested-property-in-crymea/

/incidents/36173928-traffic-collapse-by-trucks-incident/

..............................................................

I have tried the following code to solve this problem, but it doesn't work and returns a whole string back instead of just theme parts.

import numpy as np
import pandas as pd
import re

regex = r"^/(\b(\w*)\b)"

pattern_two = regex
prog_two = re.compile( pattern_two )

with open('urls.txt', 'r') as f:

    for line in f:
        line = line.strip()
    
    if prog_two.match( line ):
          print( line )

Also I have checked the regular expression (on regex101.com) like regex = r"^/(\b(\w*)\b)" and like regex = r"^/[a-z]{0,9}./", but it also doesn't work properly. I don't have a strong skills in regex and maybe I am doing something wrong?

The final result that I expect is following:

video
health
business
video
head
cis
incidents  
...........

Thank you very much for helping!


Solution

  • You might be able to just use split() here:

    with open('urls.txt', 'r') as f:
        for line in f:
            line = line.strip()   # this might be optional
            if line.startswith('/'):
                print(line.split("/")[1])
    

    In general, if avoiding the invocation of a regex engine is possible, in favor of just using base string functions, we should go for the latter option.