I am a novice at Lua (use it for the Torch7 framework). I have an input feature file which is about 1.4GBs in size (text file). The simple io.open function throws an error 'not enough memory' on trying to open this file. While browsing through the user-groups and documentation, I see that its possibly a Lua limitation. Is there a workaround for this? Or am I doing something wrong in reading the file?
local function parse_file(path)
-- read file
local file = assert(io.open(path,"r"))
local content = file:read("*all")
file:close()
-- split on start/end tags.
local sections = string.split(content, start_tag)
for j=1,#sections do
sections[j] = string.split(sections[j],'\n')
-- remove the end_tag
table.remove(sections[j], #sections[j])
end
return sections
end
local train_data = parse_file(file_loc .. '/' .. train_file)
EDIT : The input file I am trying to read contains image features I would like to train my model on. This file is in a ordered fashion ({start-tag} ...contents...{end-tag}{start-tag} ... and so on...), so it is fine if I can load these sections (start-tag to end-tag) one at a time. However, I would want all these sections to be loaded in memory.
Turns out, the simplest way around the loading large files problem is to upgrade Torch to Lua5.2 or greater! As suggested by the developers of Torch on the torch7-google-group.
cd ~/torch
./clean.sh
TORCH_LUA_VERSION=LUA52 ./install.sh
The memory limits don't exist from the 5.2 version onwards! I have tested this and it works just fine!
Reference : https://groups.google.com/forum/#!topic/torch7/fi8a0RTPvDo
Another possible solution (which is more elegant and similar to what @Adam suggested in his answer) is to use read the file line by line and use Tensors or tds to store the data as this uses memory outside of Luajit. A code sample is as below, thanks to Vislab.
local ffi = require 'ffi'
-- this function loads a file line by line to avoid having memory issues
local function load_file_to_tensor(path)
-- intialize tensor for the file
local file_tensor = torch.CharTensor()
-- Now we must determine the maximum size of the tensor in order to allocate it into memory.
-- This is necessary to allocate the tensor in one sweep, where columns correspond to letters and rows correspond to lines in the text file.
--[[ get number of rows/columns ]]
local file = io.open(path, 'r') -- open file
local max_line_size = 0
local number_of_lines = 0
for line in file:lines() do
-- get maximum line size
max_line_size = math.max(max_line_size, #line +1) -- the +1 is important to correctly fetch data
-- increment the number of lines counter
number_of_lines = number_of_lines +1
end
file:close() --close file
-- Now that we have the maximum size of the vector, we just have to allocat memory for it (as long there is enough memory in ram)
file_tensor = file_tensor:resize(number_of_lines, max_line_size):fill(0)
local f_data = file_tensor:data()
-- The only thing left to do is to fetch data into the tensor.
-- Lets open the file again and fill the tensor using ffi
local file = io.open(path, 'r') -- open file
for line in file:lines() do
-- copy data into the tensor line by line
ffi.copy(f_data, line)
f_data = f_data + max_line_size
end
file:close() --close file
return file_tensor
end
To read data from this tensor is simple and quick. For example, if you wanto to read the 10'th line in the file (which will be in the 10'th position on the tensor) you can simple do the following:
local line_string = ffi.string(file_tensor[10]:data()) -- this will convert into a string var
A word of warning: this will occupy more space in memory, and may not be optimal for some cases where a few lines are way longer than the other. But if you don't have memory issues, this can even be disregarded because when loading tensors from files into memory it is blazingly fast and might save you some grey hairs in the process.
Reference : https://groups.google.com/forum/#!topic/torch7/fi8a0RTPvDo