I have a file with 800000 rows and 13000 columns. The file looks like:
ID1 ID2 ID3 ID4 ID5
SNP1 AA AA AB AA BB
SNP2 AB AA BB AA AA
SNP3 BB BB BB AB BB
SNP4 AA AA BB BB AA
SNP5 AA AA AA AA AA
I want to replace the letters by numbers (AA = 0, AB = 1 and BB = 2). What I have done is: data[data=="AA"] = 0 It seems to be working fine in a small example, but it doesnt seem to do the job in the big file. It has taken hours. Is there a more efficient way to do it? Thank you very much. Paula.
File is likely too large for R, unless you use scan
, which overcomplicates things IMO. This is a job better handled using GNU utilities.
If you're in Windows install MSYS:
http://www.mingw.org/wiki/Getting_Started
Then use sed
as mentioned to replace text:
cat <filename> | sed "s/\bAA\b/0/g" | sed "s/\bBA\b/1/g" | sed "s/\bAB\b/1/g" | sed "s/\bBB\b/2/g" > <newfile>
Edit:
If you must use R, you will likely need to read file line-by-line as file contains ~10 billion entries, which each of 3 chars is a very large dataset indeed!
See SO thread here for reading file line-by line:
reading a text file in R line by line
However, I suspect this will be very slow.