I need to remove redundant "www." prefix in an ever-growing HUGE list of domains. Here's the sample:
# Type 1
domain1.tld
# Type 2
domain2.tld
www.domain2.tld
# Type 3
www.domain3.tld
sub.domain3.tld
foo.domain3.tld
www.sub.domain3.tld
# Expected
domain1.tld
domain2.tld
www.domain3.tld
sub.domain3.tld
foo.domain3.tld
The only thing that worked took forever since the list already contains more than 2 million lines.
cp 1.txt 2.txt
while read line; do
sed "/www.$line/d" -i 2.txt
done < 1.txt
I'm using GNU utils and already fooled around with sed, awk, comm to no avail.
How can this be done?
#! /bin/bash
awk -F. '{
if($1 != "www")
{
arr[$0]=1
}
else
if(arr[substr($0,5)] == 1)
{
next
}
print
}' file
Check this out, although I am not sure how it would work in case of 2 million records.
UPDATE:
Explanation: The awk
expression uses .
as field separator, so suppose if line is www.sub.domain3.tld
, $1=www
, $2=sub
…
It flags all lines which don't start with www
by making them index in array arr
. Suppose line is sub.domain3.tld
, it will make it index in arr[sub.domain3.tld]
and stores e
in it. Now for every line starting with www.
, it strips the www.
and checks if the remaining line is stored in array, if yes, the line is not printed.
UPDATE:
This would produce the result independent of the order in which input is supplied, although the output is in jumbled sequence:
#! /bin/bash
awk -F. '{
if ($1 != "www") {
domains["www."$0]=0
domains[$0]=1
}
else {
if (domains[$0] == ""){ domains[$0]=1 }
}
}
END {
for (domain in domains) {
if (domains[domain]) { print domain }
}
}' file
This should produce the result in correct sequence independent of the order in which input is supplied:
#! /bin/bash
awk -F. '{
if ($1 != "www") {
redundant_domains["www."$0]=1
}
domains[NR]=$0
}
END {
for (i=1 ; i < NR ; ++i) {
if (!redundant_domains[domains[i]]) { print domains[i] }
}
}' file