Search code examples
pythonsedpython-re

Python sort and delete duplicates in list an use re.sub


I am total new with Python. I try to make analog bash command: cat domains.txt |sort -u|sed 's/^*.//g' > domains2.txt File domains contains list of domains with and without mask prefix *. like:

*.example.com
example2.org

About 300k+ lines

I wrote this code:

infile = "domains.txt"
outfile = "2"
outfile2 = "3"
with open(infile) as fin, open(outfile, "w+") as fout:
    for line in fin:
       line = line.replace('*.', "")
       fout.write(line)
with open('2', 'r') as r, open(outfile2, "w") as fout2 :
    for line in sorted(r):
        print(line, end='',file=fout2)

its cut *. as planned, sort list, but doesn't remove duplicates of lines

I had advise to use re.sub instead of replace to make pattern more strict (like in sed where I do it from beginning of lines), but when I tried this:

import re

infile = "domains.txt"
outfile = "2"
outfile2 = "3"
with open(infile) as fin, open(outfile, "w+") as fout:
    for line in fin:
       newline = re.sub('^*.', '', line)
       fout.write(newline)
with open('2', 'r') as r, open(outfile2, "w") as fout2 :
    for line in sorted(r):
        print(line, end='',file=fout2)

it just doesn't work with errors, which I don't understand.


Solution

  • In regular expressions *, . and alike are special characters. You should escape them in order to use them.

    import re
    
    s = "*.example.com"
    re.sub(r'^\*\.', '', s)
    
    > 'example.com'