Is there a more efficient way of removing lines with a invalid / too long of a tld (Top Level Domain)? I'm not efficient with sed / awk. I'm wanting to remove lines from a file that are longer than 24 characters, after the last period.
What I wrote works, but is extremely slow on long lists. It takes each individual line, counts the number of characters after the period, saves the lines with more than 24 characters to a list, then removes them from the source.
Sample Input:
test.sub.xn--vermgensberatung-pwb
test.sub.xn--vermgensberatung-pwba
Expected Output:
test.sub.xn--vermgensberatung-pwb
My current code:
Source='/tmp/source'
while read -r Line || [[ -n "$Line" ]]; do
count="$(echo "$Line" | awk -F. '{ print $NF }' | awk '{ print length }')" #Count length after period
if [[ "$count" -gt '24' ]]; then echo "$Line" >> /tmp/filter; fi #Save long TLD lines
done < "$Source"
#Remove results from source
cat /tmp/filter | sort > /tmp/filter.clean
comm -23 "$Source" /tmp/filter.clean > /tmp/clean
I guess you over-complicate the script
$ cat file
www.cnn.com
this.is.notrightbutstillpass
this.will.fail.since.01234567890123456789012345
not sure the actual TLD restrictions but you can change the code easily
$ awk -F. 'length($NF)<24' file
www.cnn.com
this.is.notrightbutstillpass