The title may sound nonsense but let me explain. I need to filter a txt file. The operations I should perform are highly basic as I said. The file I am talking about is this one: http://gdac.broadinstitute.org/runs/analyses__2014_10_17/reports/cancer/BRCA-TP/Mutation_Assessor/BRCA-TP.maf.annotated
At first, I focused on this task: Please find Tumor_Sample_Barcode column in the data file. As you can see, all rows correspond to that column are in such a format: TCGA-02-0001-01C-01D-0182-01
Two characters before "C" is critical here. In the example format, these characters are "01". I am looking for these rows which contains "01" there. Namely, the rows which have any other character couple there should be eliminated.
If the size of the file is not 56.2 MB, I may handle it with MATLAB with ease. However, when I tried to split the columns of the file in MATLAB with following line, I got an error.
[numData,textData,rawData] = xlsread('BRCA-TP.maf.annotated.csv');
Although I maximized Java Heap Memory of MATLAB, I get the error of no sufficient memory to realize this task in editor.
I looked for any alternative method. JMP may help me but I have no experience on that software. Even a basic operation just like I described above may be painful for me.
Is there a way to achieve the operation I explained above in MATLAB? If not, can you help me to figure out how can I write a script in JMP to do it?
This can be done with a simple "awk" command:
awk '$16 ~ /....-..-....-01C-...-....-../' BRCA-TP.maf.annotated > BRCA-TP.maf.annotated.filtered
The 16 means look at the 16th column, the term inside the // is a regular expression (where dots represent any letter)
"awk" is available on any unix-like operating system such as Mac OS X and Ubuntu, but if you're running windows you'd have to download and install Cygwin or other such utility.