Search code examples
regexscriptinggrepbioinformaticsstring-search

Is there a good way to find exact matches of a extremely long string ~500 characters from a couple megabyte sized CSV file?


I'm trying to find a match of a ~500 character long DNA sequence from a few megabyte large CSV file containing different sequences. Before each sequence in the CSV file, there is some metadata I would like to have. Each sequence and sequence metadata take up exactly one line. I've tried

grep -B 1 "extremelylongstringofDNATACGGCATAGAGGCCGAGACCTAGGATTAACGTTACTGACGAT" csvfile.csv

However that returns filename too long

An interesting and frustrating thing I bumped into was when I tried to find the line count of the csv file by using

wc -l csvfile.csv

it returned

0 csvfile.csv

And without the -l flag, it returned

0  161410 41507206 csvfile.csv

This is the result even after I added a line between the end of each sequence and the start of the following metadata of the next sequence.


Solution

  • The issue was that the file had CR line terminators and GNU tools were not detecting any line endings and therefore was reading the file as one huge line. I solved the issue by using mac2unix to convert the file to make it GNU line-ending readable.

    Thanks to Etan Reisner for providing the hint