What's the fastest and most memory efficient way to import a large tab file into R? The file in question (unitigs.rtab) is around 27GB and I need the entire file imported into R so I can eventually use a genomics tool on the entire dataset. The file, unitigs.rtab consists of 2806 rows (genome names) and 5682556 columns (unitig names), with a binary results to indicate the absence or presence of each unitig in each genome.
An example/subset from the unitig file showing the first 10 lines and first 5 columns:
head -n 10 "unitigs.rtab" | cut -f 1-5
Unitig | AAAAGTTCGATTTATTCAACAACGCATG | ATCATTAAGGAAGGTGCGAATAAGCGAGA | ACGAAATCTTATTTAAACAAAGCCTGCT | CGAAATCTGATTTATTCAAAGCCACGCC |
---|---|---|---|---|
Genome_1000 | 0 | 0 | 0 | 0 |
Genome_1001 | 0 | 0 | 0 | 0 |
Genome_1007 | 0 | 0 | 0 | 0 |
Genome_1022 | 0 | 0 | 0 | 0 |
Genome_1024 | 0 | 0 | 0 | 0 |
Genome_1095 | 0 | 0 | 0 | 0 |
Genome_1097 | 0 | 0 | 0 | 0 |
Genome_1116 | 0 | 0 | 0 | 0 |
Genome_1117 | 0 | 0 | 0 | 0 |
I have tried importing the file using fread but even with 925GB of memory and 8 CPUs I run into the below error. Is there a memory efficient way to import this large unitigs.rtab file into R?
R fread command to import unitigs.rtab from scriptA.R:
library(data.table)
unitig_file <- fread("unitigs.rtab", verbose = TRUE)
Error:
OpenMP version (_OPENMP) 201511
omp_get_num_procs() 8
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 2147483647
omp_get_max_threads() 8
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 4 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=8, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 0
0/1 column will be read as integer
[02] Opening the file
Opening file unitigs.rtab
File opened, size = 27.27GB (29279214177 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Unitig_sequence AAAAGTTCGATTTA>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=0x9 with 100 lines of 5682556 fields using quote rule 0
Detected 5682556 columns on line 1. This line is either column names or first data row. Line starts as: <<Unitig_sequence AAAAGTTCGATTTA>>
Quote rule picked = 0
fill=false and the most number of columns found is 5682556
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (29279214176 bytes from row 1 to eof) / (2 * 1423294172 jump0size) == 10
Type codes (jump 000) : C5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555 Quote rule 0
Type codes (jump 010) : C5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555 Quote rule 0
'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 1062 sample rows
=====
Sampled 1062 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 28981067180
Line length: mean=11365124.05 sd=-nan min=11365116 max=11365134
Estimated number of rows: 28981067180 / 11365124.05 = 2551
Initial alloc = 2806 rows (2551 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : C5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555
[10] Allocate memory for the datatable
Allocating 5682556 column slots (5682556 - 0 dropped) with 2806 rows
[11] Read the data
jumps=[0..2), chunk_size=14490533590, total_size=28981067180
*** caught segfault ***
address 0x7f6115d70ebe, cause 'memory not mapped'
Traceback:
1: fread("unitigs.rtab", verbose = TRUE)
An irrecoverable exception occurred. R is aborting now ...
job3362275/slurm_script: line 12: 14265 Segmentation fault (core dumped) Rscript scriptA.R
I would suggest two solutions.
One is to split the file into slices of, say, 10000 columns with commands like cut -f 1-10000
and read them separately.
Other is to export the file into a binary format first using the my filematrix
package with the code like this (I fed it the sample data):
library(filematrix)
# Convert into a binary file (it becomes transposed)
fm = fm.create.from.text.file(
textfilename = 'unitigs.rtab',
filenamebase = 'binaryfile',
skipRows = 1,
skipColumns = 1,
sliceSize = 3,
delimiter = '\t',
type = 'integer',
size = 1)
> Rows read: 3
> Rows read: 6
> Rows read: 9
> Rows read: 9 done.
# Check dimensions
dim(fm)
> [1] 4 9
# Extract first two columns (rows of the original file)
fm[, 1:2]
> [,1] [,2]
> [1,] 0 0
> [2,] 0 0
> [3,] 0 0
> [4,] 0 0
# Convert to an R matrix
mat = as.matrix(fm)
close(fm)