I have used a regex search to filter down some results from a text file (searching for ".js") which has given me roughly around 16 results some of which are duplicates. I want to remove duplicates from that output and print either onto the console or redirect it into a file. I have attempted the use of sets and dictionary.fromkeys with no success! Here is what I have at the moment, thank you in advance:
#!/usr/bin/python
import re
import sys
pattern = re.compile("[^/]*\.js")
for i, line in enumerate(open('access_log.txt')):
for match in re.findall(pattern, line):
x = str(match)
print x
Using sets to eliminate duplicates:
#!/usr/bin/python
import re
pattern = re.compile("[^/]*\.js")
matches = set()
with open('access_log.txt') as f:
for line in f:
for match in re.findall(pattern, line):
#x = str(match) # or just use match
if match not in in matches:
print match
matches.add(match)
But I question your regex:
You are doing a findall
on each line, which suggests that each line might have multiple "hits", such as:
file1.js file2.js file3.js
But in your regex:
[^/]*\.js
[^/]*
is doing a greedy match and would return only one match, namely the complete line.
If you made the match non-greedy, i.e. [^/]*?
, then you would get 3 matches:
'file1.js'
' file2.js'
' file3.js'
But that highlights another potential problem. Do you really want those spaces in the second and third matches for these particular cases? Perhaps in the case of /abc/ def.js
you would keep the leading blank that follows /abc/
.
So I would suggest:
#!/usr/bin/python
import re
pattern = re.compile("""
(?x) # verbose mode
(?: # first alternative:
(?<=/) # positive lookbehind assertion: preceded by '/'
[^/]*? # matches non-greedily 0 or more non-'/'
| # second alternative
(?<!/) # negative lookbehind assertion: not preceded by '/'
[^/\s]*? # matches non-greedily 0 or more non-'/' or non-whitespace
)
\.js # matches '.js'
""")
matches = set()
with open('access_log.txt') as f:
for line in f:
for match in pattern.findall(line):
if match not in matches:
print match
matches.add(match)
If the filename cannot have any whitespace, then just use:
pattern = re.compile("[^\s/]*?\.js")