Search code examples
linuxbashawkseddomain-name

Remove lines with long TLD (Top Level Domain)


Is there a more efficient way of removing lines with a invalid / too long of a tld (Top Level Domain)? I'm not efficient with sed / awk. I'm wanting to remove lines from a file that are longer than 24 characters, after the last period.

What I wrote works, but is extremely slow on long lists. It takes each individual line, counts the number of characters after the period, saves the lines with more than 24 characters to a list, then removes them from the source.

Sample Input:

test.sub.xn--vermgensberatung-pwb
test.sub.xn--vermgensberatung-pwba

Expected Output:

test.sub.xn--vermgensberatung-pwb

My current code:

Source='/tmp/source'

while read -r Line || [[ -n "$Line" ]]; do
count="$(echo "$Line" | awk -F. '{ print $NF }' | awk '{ print length }')" #Count length after period
if [[ "$count" -gt '24' ]]; then echo "$Line" >> /tmp/filter; fi           #Save long TLD lines
done < "$Source"

#Remove results from source
cat /tmp/filter | sort > /tmp/filter.clean
comm -23 "$Source" /tmp/filter.clean > /tmp/clean

Solution

  • I guess you over-complicate the script

    $ cat file
    www.cnn.com
    this.is.notrightbutstillpass
    this.will.fail.since.01234567890123456789012345
    

    not sure the actual TLD restrictions but you can change the code easily

    $ awk -F. 'length($NF)<24' file
    www.cnn.com
    this.is.notrightbutstillpass