Comparison of extra-large subsets of strings

There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:

850025000010145
401115000010152
400025000010166
770025555010152
512498004158752

From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.

Which language and algorithm should I use? How to implement it?

Solution

You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.

First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat and sort commands:

cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort

(The -u removes duplicates)

The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.

So instead, sort each individual file, then merge them:

sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted

The -m switch merges the already-sorted files.

Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:

sort -u today >today_sort

Now, you can compare the files and output only the files unique to today's file:

comm -2 -3 today_sort combined_sort

-2 says suppress lines that occur only in the second file, and -3 says to suppress lines that are common to both files. So all you'll get is the lines in today_sort that don't exist in combined_sort.

Now, if you're going to do this every day, then you need to take the output from the comm command and merge it with combined_sort so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort file every day. So:

comm -2 -3 today_sort combined_sort > new_values

Then:

sort -m combined_sort new_values > combined_sort_new

You'd probably want to name the file with the date, so you'd have combined_sort_20140401 and combined_sort_20140402, etc.

So if you started at the beginning of the year and wanted to do this every day, your script would look something like:

sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort

$todays_file, $old_combined_sort, and $new_combined_sort are parameters that you pass on the command line. So, if the script was called "daily":

daily todays_file.txt all_values_20140101 all_values_20140102