Search code examples
sortinglongest-substring

Find the most repeated sequence of chars in a list of domains


I have a very long list of domains (for example: blockme.com, do.remove-me.com.. etc) millions of them. I would like to use this list to create a list of regular expressions to cover as much as possible of these domains.

I normally use a terminal command to find the most repeated character sequences:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head  -10

which gives me:

 125054 com
  11715 net
   7823 us
   7502 co
   6736 2
   6628 west
   6495 sequoia
   6464 org
   5901 uk
   5815 sex

The problem is that this command does not work as intended. For example the word sex is actually repeated 90k times in the file, but it is only detected 5815 times. And com exists 127k times, not 125k.

I did some research and I found out that this problem is called the Longest repeated substring problem. I found many implementations especially here on stackoverflow, but none of them could replicate the output of the terminal command.

Can someone help me write it please in either python or c++?


Solution

  • I wrote a brute force answer which takes a couple of minutes but achieves this solution.

    You will have to run tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head -10 on the resulting file:

    import sys
    sys.path.append('../')
    from utils import *
    
    minimum_split = 3
    
    def return_splits(string, count):
        cut_len = len(string) - count
        results = []
        for i in range(count+1):
            results.append(string[i:cut_len+i])
        return results
    
    if __name__ == '__main__':
        input_file_name = str(sys.argv[1])
        output_file_name = input_file_name + "_out"
    
        file_to_clean = readLinesFromTextFile(input_file_name, remove_zeros=False, remove_comments=True)
    
        splitted = []
        for line in file_to_clean:
            data = line.split()
            for line2 in data:
                splitted.append(line2.strip())
    
        splitted_final = []
        for line in splitted:
            if (len(line) < 3):
                continue
    
            if (len(line) == 3):
                splitted_final.append(line)
                continue
            else:
                for t in range (len(line) - (minimum_split-1)):
                    if (t == 0):
                        splitted_final.append(line)
                    else:
                        res1 = return_splits(line, t)
                        for tt in res1:
                            splitted_final.append(tt)
    
        with open(output_file_name, "w") as txt_file:
            for line in tqdm(splitted_final):
                txt_file.write(str(line) + "\n")