Search code examples
bashshellawksedzsh

How to remove redundant www subdomain in a shell script (sed/awk/etc)?


I need to remove redundant "www." prefix in an ever-growing HUGE list of domains. Here's the sample:

# Type 1
domain1.tld
# Type 2
domain2.tld
www.domain2.tld
# Type 3
www.domain3.tld
sub.domain3.tld
foo.domain3.tld
www.sub.domain3.tld

# Expected
domain1.tld
domain2.tld
www.domain3.tld
sub.domain3.tld
foo.domain3.tld

The only thing that worked took forever since the list already contains more than 2 million lines.

cp 1.txt 2.txt
while read line; do
  sed "/www.$line/d" -i 2.txt
done < 1.txt

I'm using GNU utils and already fooled around with sed, awk, comm to no avail.

How can this be done?


Solution

  • #! /bin/bash
    
    awk -F. '{
        if($1 != "www")
        {
            arr[$0]=1
        }
        else
        if(arr[substr($0,5)] == 1)
        {
            next
        }
        print
    }' file
    

    Check this out, although I am not sure how it would work in case of 2 million records.

    UPDATE:

    Explanation: The awk expression uses . as field separator, so suppose if line is www.sub.domain3.tld, $1=www, $2=sub

    It flags all lines which don't start with www by making them index in array arr. Suppose line is sub.domain3.tld, it will make it index in arr[sub.domain3.tld] and stores e in it. Now for every line starting with www., it strips the www. and checks if the remaining line is stored in array, if yes, the line is not printed.

    UPDATE:

    This would produce the result independent of the order in which input is supplied, although the output is in jumbled sequence:

    #! /bin/bash
    
    awk -F. '{
        if ($1 != "www") {
            domains["www."$0]=0
            domains[$0]=1
        }
        else {
            if (domains[$0] == ""){ domains[$0]=1 }
        }
    }
    END {
        for (domain in domains) {
            if (domains[domain]) { print domain }
        }
    }' file
    

    This should produce the result in correct sequence independent of the order in which input is supplied:

    #! /bin/bash
    
    awk -F. '{
        if ($1 != "www") {
            redundant_domains["www."$0]=1
        }
        domains[NR]=$0
    }
    END {
        for (i=1 ; i < NR ; ++i) {
            if (!redundant_domains[domains[i]]) { print domains[i] }
        }
    }' file