I have an html file having tons of relative href links like;
href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014/a>br/>
There are tons of other http and ftp links in the file,
I need an output txt file;
14/02/08: station1_140208.txt
14/02/09: station1_140209.txt
14/02/10: station1_140210.txt
14/02/11: station1_140211.txt
14/02/12: station1_140212.txt
I tried to write my own but it takes too long for me to get used to Python regex.
I can open the source file, apply a specific regex which I couldn't figure out yet, and write it back to the disk.
I need your help on the regex side.
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
test:
import re
s = """
<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>
br/>
<a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a>
br/>
<a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a>
br/>
"""
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
re.findall(pattern,s)
output:
[('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')]