I have two directories of text files. One has the User Description
and the other has User Messages
with fields DATE<\t>NAME<\t>DESCRIPTION
and DATE<\t>NAME<\t>MESSAGE
. My main objective is to get correlation matrix between the profile and message words between these two files.
One example would be:
*message words*
cat dog mouse ....
*profile words* cat 100 20 50
dog 2 30 22 ...
...
...
Here, the number 100 between cat and cat means "the word "cat" appeared 100 times in all messages written by any user with "cat" in their profile description".
I tried to solve this problem using Java, but even a much simpler version of this program did not run because of the size of the text files. Here's the problem I posted few days ago. The awk
tool did solve my previous problem.
My question is, is there any efficient way to solve this type of problem? I do not have language restrictions. Also, I have some knowledge in bash utilities like diff, cat, etc.
Just for information, my User Messages
has 1.7G of multiple text files. User Description
is around 400M also with multiple files. The most memory I can give to Java is -Xmx1800m.
Also, if this is not a valid question, please let me know. I will remove the post.
Thank you!
Try look at Lucene library it is originated from java but also ported to C# and C++ (at least).
What you do is called "indexing" - you create document, (for example it can be associated with simple file). Each document can contain optional fields - directory where the file appears. After that it is very easy to count number of particular words or even more word-forms (like a cat vs cats)