string split out-of-memory text-files large-data

Prepare a correlation matrix of huge text files

I have two directories of text files. One has the User Description and the other has User Messages with fields DATE<\t>NAME<\t>DESCRIPTION and DATE<\t>NAME<\t>MESSAGE. My main objective is to get correlation matrix between the profile and message words between these two files.

One example would be:

                              *message words*
                          cat     dog    mouse ....
*profile words*    cat    100     20      50
                   dog     2      30      22  ...
                   ...
                   ...

Here, the number 100 between cat and cat means "the word "cat" appeared 100 times in all messages written by any user with "cat" in their profile description".

I tried to solve this problem using Java, but even a much simpler version of this program did not run because of the size of the text files. Here's the problem I posted few days ago. The awk tool did solve my previous problem.

My question is, is there any efficient way to solve this type of problem? I do not have language restrictions. Also, I have some knowledge in bash utilities like diff, cat, etc.

Just for information, my User Messages has 1.7G of multiple text files. User Description is around 400M also with multiple files. The most memory I can give to Java is -Xmx1800m.

Also, if this is not a valid question, please let me know. I will remove the post.

Thank you!

Solution

Try look at Lucene library it is originated from java but also ported to C# and C++ (at least).

What you do is called "indexing" - you create document, (for example it can be associated with simple file). Each document can contain optional fields - directory where the file appears. After that it is very easy to count number of particular words or even more word-forms (like a cat vs cats)