Search code examples
rdataframegenome

Replacing factor levels more efficiently in a huge file


I have a file with 800000 rows and 13000 columns. The file looks like:

        ID1 ID2 ID3 ID4 ID5
SNP1    AA  AA  AB  AA  BB
SNP2    AB  AA  BB  AA  AA
SNP3    BB  BB  BB  AB  BB
SNP4    AA  AA  BB  BB  AA
SNP5    AA  AA  AA  AA  AA

I want to replace the letters by numbers (AA = 0, AB = 1 and BB = 2). What I have done is: data[data=="AA"] = 0 It seems to be working fine in a small example, but it doesnt seem to do the job in the big file. It has taken hours. Is there a more efficient way to do it? Thank you very much. Paula.


Solution

  • File is likely too large for R, unless you use scan, which overcomplicates things IMO. This is a job better handled using GNU utilities.

    If you're in Windows install MSYS:

    http://www.mingw.org/wiki/Getting_Started

    Then use sed as mentioned to replace text:

    cat <filename>  | sed "s/\bAA\b/0/g" | sed "s/\bBA\b/1/g" | sed "s/\bAB\b/1/g"  | sed "s/\bBB\b/2/g" > <newfile>
    

    Edit:

    If you must use R, you will likely need to read file line-by-line as file contains ~10 billion entries, which each of 3 chars is a very large dataset indeed!

    See SO thread here for reading file line-by line:

    reading a text file in R line by line

    However, I suspect this will be very slow.