Search code examples
pythonregexhyperlinkrelative-path

Python Regex to extract relative href links


I have an html file having tons of relative href links like;

href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014/a>br/>

There are tons of other http and ftp links in the file,
I need an output txt file;

14/02/08: station1_140208.txt  
14/02/09: station1_140209.txt  
14/02/10: station1_140210.txt  
14/02/11: station1_140211.txt  
14/02/12: station1_140212.txt  

I tried to write my own but it takes too long for me to get used to Python regex.
I can open the source file, apply a specific regex which I couldn't figure out yet, and write it back to the disk.

I need your help on the regex side.


Solution

  • pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
    

    test:

    import re
    s = """
    <a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>
    br/>
    <a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a>
    br/>
    <a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a>
    br/>
    """
    pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
    re.findall(pattern,s)
    

    output:

    [('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')]