I have one 78k lines .txt file with british words and a 5k lines .txt file with the most common british words. I want to sort out the most common words from the big list so that I have a new list with the not as common words.
I managed solve my problem in another matter, but I would really like to know, what I am doing wrong since this does not work.
I have tried the following:
//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
//But this procedure apparently gives me two empty files.
If I run just the grep without cut first, I get words that I know are in both files.
I have also tried this:
sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
//No luck either
The two text files in case anyone wants to try for them selves: https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt
After downloading your files, I noticed that (a) brit-a-z-sorted.txt
has Microsoft line endings while 5k-most-common-sorted.txt
has Unix line endings and (b) you are trying to do whole-line compare (grep -x
). So, first we need to convert to a common line ending:
dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt
Now, we can use grep
to remove the common words:
grep -xivFf 5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt
I also added the -F
flag to assure that the words would be interpreted as a fixed strings rather than as regular expressions. This also speeds things up.
I note that there are several words in the 5k-most-common-sorted.txt
file that are not in the brit-a-z-sorted.txt
. For example, "British" is in the common file but not the larger file. Also the common file has "aluminum" while the larger file has only "aluminium".
What do the grep options mean? For those who are curious:
-f
means read the patterns from a file.
-F
means treat them as fixed patterns, not regular expressions,
-i
mean ignore case.
-x
means do whole-line matches
-v
means invert the match. In other words, print those lines that do not match any of the patterns.