I have two text files. "file1.txt" contains a big number of strings of characters and each string has 77 characters. Each string is in its own line and every group of lines makes a block of lines separated by a new line. Each line looks something similar to the example below. Blocks are made of different number of lines as follows:
ATGATTTCCTGGGGGCTGAGCTGTTTGTCCGTGCTGGGGGCTGCCGGCACCACTCTCCTGTGTGCGGGT
TGCTGCTCGGCCTGGCCCAACAACTCTGGACCCTCCGCTGGACTCTGAGCCGGGATTGGGCCTCCACCT
GCCCCTACCCAAGGGCTCCATGGGCTGGCCATTCTTCGGTGAAACGCTGCACTGGTTGGTACAGGGCTC
CGTTTCCACAGTTCCCGCCGCGAGCGCTACGGGACAGTGTTTAAGACGCACCTTCTGGGCAGGCCAGTG
TCCGGGTGAGCGGCGCTGAGAACGTGCGCACCATCCTGCTGGGCGAGCACCGCCTGGTGCGTAGCCAGT
GCCACAGAGTGCGCATATTCTACTAGGGTCACACACACTACTTGGCGCGGTTGGTGAGCCCCATCGGCA
CGGCGTAAGGTCCTGGCGCGCGTGTTCAGCCGCTCCTCTCTGGAGCAATTCGTGCCACGGTTGCAGGGG
CGCTGCGGCGAGAGGTGCGCTCCTGGTGCGCCGCCCAACGACCGGTGGCTGTTTACCAGGCGGCCAAAG
ACTCACCTTCCGCATGGCCGCGCGCATCCTGCTGGGTCTGCAGCTGGACGAAGCGCGATGCACCGAGCT
GCCCATACCTTTGAACAGCTGGTGGAGAACCTCTTCTCACTGCCCTTGGACGTACCGTTCAGCGGCCTG
GCAAGGGCATCCGGGCCAGGGACCAGTTGTATGAGCACCTGGATGAGGCCGTCGCTGAGAAGCTTCAGG
GAAACAGACAGCAGAGCCAGGTGATGCCCTGCTCTTGATTATTAACAGCGCTAGGGAGCTGGGCCACGA
CCCTCAGTGCAAGAGCTGAAGGAGTTGGCTGTAGAGCTCCTCTTCGCGGCCTTTTTCACCACAGCCAGC
CCAGCACATCCCTCATCCTGCTGCTTCTGCAGCACCCAGCAGCCATCACCAAAATCCAGCAGGAGCTGT
AGCGCAGGGCCTGGGGCGCGCGTGCACTTGCACACCCAGAGCCTCAGGATCGCCACCGGACTGCGGTTG
GAGCCGGACCTTAGCCTGGCCATGCTGGGCCGTTTGCGCTACGTCGACTGCGTAGTCAAGGAGGTGCTG
GCCTCCTACCGCCGGTGTCCGGGGGCTACCGCACTGCGCTGCGCACCTTTGAACTGGACGGTTACCAGA
CCCCAAAGGCTGGAGCGTGATGTATAGCATCCGAGACACGCATGAGACAGCCGCAGTGTACCGTAGCCC
CCCGAGGGCTTCGATCCGGAGCGCTTTGGCGTGGAGAGTGGAGACGCGCGGGGCTCCGGTGGCCGCTTT
ATTACATCCCGTTCGGCGGCGGCGCGCGCAGCTGCCTGGGGCAGGAGCTAGCGCAGGCGGTGCTGCAAC
GCTCGCAGTCGAGCTGGTGCGCACCGCGCGCTGGGAGCTGGCCACACCTGCCTTCCCCGTAATGCAGAC
GTGCCCATCGTGCACCCGGTGGACGGGCTGCTGCTCTTTTTCCACCCTCTTCCGACTTCGGGTGCGGGA
ATGGGTTACCCTTCTG
CTTTTCACCAGCTTGGTTTCACCTTACAGCTGCAGTGAGCCAGTTTCAGTTGGAGGAGAGGCCACATCC
CTTTGCTGTAGGCCTCTGGTTAGAAGCATGCATGGCTGGCTGCTCCTGGTCTGGGTCCAGGGGCTGATA
AGGCTGCCTTCCTCGCTACAGGAGCCACAGCAGGCACGATAGATACAAAGAGGAACATCTCTGCAGAGG
AGGTGGCTCTGTCATCTTACAGTGTCACTTCTCCTCTGACACAGCTGAAGTGACCCAAGTCGACTGGAA
CAGCAGGACCAGCTTCTGGCCATTTATAGTGTTGACCTGGGGTGGCATGTCGCTTCAGTCTTCAGTGAT
GGGTGGTCCCAGGCCCCAGCCTAGGCCTCACCTTCCAGTCTCTGACAATGAATGACACGGGAGAGTACT
CTGTACCTATCATACGTATCCTGGTGGGATTTACAAGGGGAGAATATTCCTGAAGGTCCAAGAAAGCTC
GTGGCTCAGTTCCAGACTGCCCCGCTTGGAGGAACCATGGCTGCTGTGCTGGGACTCATTTGCTTAATG
TCACAGGAGTGACTGTACTGGCTAGAAAGAAGTCTATTAGAATGCATTCTATAGAAAGTGGCCTTGGGA
AACAGAAGCGGAGCCACAGGAATGGAACCTGAGGAGTCTCTCATCCCCTGGAAGCCCTGTCCAGACACA
ACTGCCCCTGCTGGTCCCTGTGGAGAGCAGGCAGAAGATGACTATGCTGACCCACAGGAATACTTTAAT
TCCTGAGCTACAGAAGCCTAGAGAGCTTCATTGCTGTATCGAAGACTGGCTAACGACAGCTCTCTATCC
TCTCCCTATGTCTCTCTCTCTCTGTCTCTCTCTGTCTCTCTCTGTCTCTCTCTGTCTCTGTCTCTGTCT
TGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTGTGTGTGTGTGTGTGTGTG
GAAAGAAAGCTAGGACTGGCTCAAAGGTTGGATCTGAAAGTTGGGGTTGATGAATGGCCCATGTCAAAG
CTTAGGAAAGCGGAGGGGCTTTTGGAGGGTTGCACTTCGGCCGTCACCGTCTTCAGGAAACTCCCTTTG
ATTCCAAGGCTCAAAGTCAGAAAATGAGATACAAGACATCCTTGGTGATGAGGAAACGATTACGGCTTT
CCGAAACACTCTTAAAGAGTCAAGTAGCAGCTCTGGACACCATGGCCCCCAGCTCACCGCCGCCTCCAG
CCCTCGGTGTTCCCGGGCCTCCACGAGGAGCCTCCCCAGGCCTCCCCCAGCCGTCCTTTGAATGGACTC
TGCGTCTGGGGCTCCCTGGAGACATGTACGCGCGGCCGGAGCCCTTCCCGCCAGGGCCTGCGGCCCGCA
CGACGCCCTGGCAGCTGCCGCAGCCCTGCATGGCTACGGGGGCATGAACCTGACGGTGAACCTCGCTGC
CCCCACGGTCCTGGCGCTTTCTTCCGCTACATGCGCCAGCCCATCAAACAGGAGCTCATCTGCAAGTGG
TGGCGGCCGACGGCACCGCGACCCCGAGCCTCTGCTCCAAAACTTTCAGCACCATGCACGAGCTGGTCA
GCACGTCACCGTGGAGCACGTCGGCGGCCCGGAACAGGCCAACCACATTTGCTTCTGGGAGGAGTGTCC
CGCCAGGGAAAGCCCTTCAAAGCCAAATACAAACTTGTAAATCACATCCGCGTGCACACGGGCGAGAAG
CCTTCCCTTGTCCTTTCCCGGGGTGTGGGAAGGTCTTTGCTAGATCAGAAAATCTCAAAATACACAAAC
AACTCACACAGGCGAGAAGCCCTTCAGATGCGAGTTCGAGGGCTGCGAGCGGCGCTTCGCCAACAGCAG
GACCGTAAGAAGCATTCGCACGTGCACACTAGCGACAAGCCATACACGTGCAAGGTGCGGGGCTGCGAC
AGTGCTACACGCACCCCAGCTCGCTGCGTAAGCACATGAAGGTGCACGGGCGCTCGCCGCCGCCCAGCT
TGGCTACGATTCGGCTACACCGTCTGCCCTCGTGTCGCCCTCGTCGGACTGCGGCCACAAGTCCCAGGT
GCCTCCTCGGCGGCGGTGGCGGCGCGTACCGCCGACTTGAGCGAATGATGTCCACCGCGTTGCTCGCAA
GTAATCTCGCTCCGCGCAGCTGAGCGCCCCGCATCTCGCGCCTGCTACATCAAAGGGCCCGCGCACAAA
CAGTGTTTCTTCGCCACGGTGCATCTTCATGGTAAGTTAGGATTTCTATGGCAATGTGCAAGTCGCACT
AAATCCTGAAAGGCCAAGCCTGGAGCCCGTCCAGGCTTTTCATTAAGGACATAATATTTACGTCTAACA
ACCTTTTTTCTTGTGTATACAAGTATATATTTTTGTTTGACGCGGACTAAATCATTTTCATTTAATTTC
GGTAAACAAAACCCACGCGAATGGGCACTTGTACCCGATCATAATAAAAATGGATAATAATGTGAAGGA
GAAAAGAGCCGCTTGAATCGCCGCTCAGCCCCCTTTGTTTCTGCTTTCTGCGGTGATCAGAGGGCGCGT
TGGGTTTGATGGCGAGTTTCTAAAGGCGAGGAAATGGTTTGTAAGAGGGGAAAGAAAAGGAGAAAGGTC
AATCAAGCTCGGGTTGTTCAAAGAGTCGGGTTTTGGGGTTGAAAGTGTGAGTTTGACGGTGCATCAGCA
GCCGCGTTAGGCTCGCCATGGAAATACGCGCGGGGAGCGGCCGCTTCAAAGGCGGCACACTTCACTACA
ACACTCTATTAAGATACATTTGCGCTGACCTTTGCTTTCACGCCATTTAATACTGTCACTGCGCTCTCC
GTATATACTTCCTTTCTAGAACCCGACTTGCCCACGTTTAGGGGTTCACTCTGCACCCTGATGTGGGAG
CTTTGGCGCAGGGGACACTTTCAGGAAAGGGAGGAGCACAAGGACTCTGTGCATCTTGACTGCACCCCA
AGAGGCTCCAGGATCAGGAGTGAAAGATTTTAAAGCAGCCTCCGAAGCTTAACAAATGAGCATTCCAAG
TCAGTTTTGTGCAAATCGCCTTTCTGACTCTTGAGTAGGATGGAGGCTTAAATTTAATGGCGACTTGGG
GGAAGGGAGCCACCCTGGGGGAGTCTGAGGAGTTCAGACTGTGCCCTTGGGAATTTCCACTCTGGCTTT
CGTGCCACTCTTCTTCCTTTCCATCCCAAAAGTCTCTTGCGGCCCCTGAAACTTGTTTCTTTCTAAGGC
GGGTGTGTGGTACCCTTAGGCCTGGACTAGTCCTAGATGCAAACTCAAGAGCCCAAGGCCAAGGGGATG
GGGGAAGATGGCAGGAAAGTTAGAAGTCCATGTTCCCTTAATTGTCTTGTTGTTTATTTTATCCAAGTA
CCCAGTGAATAGGGGAAAAATAAACACAGTGAAAAAAAAAATCAAACAGTGGAGTCTTCTTTAGTGCCA
TCCTTGTGGTTGAATAAAAAGGATGGTCCGCTTTCTATTGAGCTGAGAAATCTTTGAAGTGGGAGTTAT
ATCTGAGACATTCCTGCTTGTCGTCCTAACAACGCTGATGAAACGTAAAAGGTTCTTTGTCAGCGATTT
TTCTCCTCTCTGTCAAACTCCCTCTGCCCCGTTAGTTTCAAACCGTTTCTAAAGAGATAAAAATCAAAC
TCTTTTAAAACAATATCCACACACTGCATCAATACATAACTTTAGGTCTAAGTCTTGCTAAGGGATAAA
AAAAGCAATGCCTAGACATCAGGGTCAGGGCCTGGTCTGGTGAAGTATGCAGAAGTTGGGGGGCCCTCG
GACAAGCTTTGGGACATGAGGAAAAGAATGCAGAGAGGGTGCAAGCAGAATACATACCCTAAGTCCATA
TTGTGTTTCTGCTTCTTTCTGCTCTGGTTTGCATTCAATCAGCCCAAGTTGGGTCACATAGATGGGTTT
CTTTGGGTACCCCTCAGGCTCCTAATATTCTTGCCCAGGATCCTTGGAACTTAAGAATGCAGCCAAGCA
TTGTTAATATCTCCTGCTCCTTCAAAGCCACCTCTGCTAAAAATAGACCCATTGTGTGTTTCTTCTCAC
AGCAGCAATCAACAAGCCCTTTCTGCCGTTAATAAGAAGGAGAATAGCTGAAGGAGAGAGATATTTTAT
AATTTCCTGTTTCCTTCAGAATCTTGGCAATTGAAGTTTAGAAGGTTTGGTCTACAACACAGTGATCGA
AATGCATGTAAATGCCCATCCTTCCCTTCATTCACGTGTGAAGTTGTTCATTTTATATTGTGCCCAGCA
AGAAACTTTCACCCAGTTCAGGTTTCCCCAAAACTCCTGTGGTGGTTTTAAAGGTGGTTTAAATAAATA
GGATGTGCTGGTCCCCCTACTCTGTGTGTGCTGAATAAATGGCTTGTAAAGAAGTTTTTCCAAGCTGTA
CCCATGCTGTTATTATAGTTGCTGCAAAATGTTCTTCCTGATATTGATTTTATTTGTTAACTGAAGGTC
CCATATGTTTGTTTATATTGCTAATTTATGAGAAAATGTAATAATTGCAATGAATGTGAATTATACAGA
AGGCAAACATTTTGTAATCATAATTCACATATACACAAAAGCCTGGCTGAAATCTTTAGACTATTTGTA
CCTCTCTACCCACACTGTTTGTGATTTATCATCTGTCTCTTTAGTGTCAGTTAAATTATGAACTAACTT
AAAATAAAAGTTGTTTGACTGAAAGTGATTGTTGAATGAACAACAAAGTTGAAAGCCATGGCTTGATCT
GTAAATATATAAATGTAAATGATATTAAATCTGTGATTCCTTTTCCCTCCAAAGGCTTTTGTGTACATG
CGCTGCATTTGGCTATTTTCTTTGGAAATAAATAATGTGATGTTTCTCTTCCTCTTTTGA
I want to do two things:
I want to search for each line in file1.txt in another larger file, file2.txt, (with 10M+ lines). I tried grep -f file1.txt file2.txt
but it only searches for parts of the string, not the whole string. I also used for i in $(cat file1.txt); do grep $i file2.txt; done
but the result was not what I wanted.
I also want to search for each block of lines in file1.txt in file2.txt. Blocks are separated by an empty line as it is shown in the example above.
Since I have very large files, I would like my commands not to repeatedly reread files of very large sizes (10s GB), so that does not consume huge amounts of memory.
I'm guessing that you probably want this for your first problem (untested due to no testable example in the question):
awk '!NF{next} NR==FNR{a[$0]; next} $0 in a' file1.txt file2.txt
and this (again untested) for your second one:
awk -v RS= -v ORS='\n\n' 'NR==FNR{a[$0]; next} $0 in a' file1.txt file2.txt
The first script above assumes you want to do full-line literal string comparisons as opposed to partial line and/or regexp comparisons. The second script similarly assumes you want to do full-block literal string comparisons.
Your grep would have produced large amounts of output because the empty lines from the first file would tell grep (or awk) to look for null strings in the second file and those would match on every line.