I have a raw text file, with the size of 8.1GB. The input data is very straight forward: Lab_A (string), Lab_B (string), Distance (float)
I was trying to load the data into a table, using LOAD DATA INFILE, but the drive ran out of space.
The destination table had the following format:
Id (INT), Lab_A (VARCHAR), Lab_B (VARCHAR), Distance (FLOAT). With a primary key of Id and an index of (Lab_A + Distance).
Create statement below:
CREATE TABLE 'warwick_word_suite'.'distances' (
'id' INT NOT NULL AUTO_INCREMENT,
'label1' VARCHAR(45) NOT NULL,
'label2' VARCHAR(45) NOT NULL,
'distance' FLOAT NOT NULL,
PRIMARY KEY ('id'),
INDEX 'LABEL_INDEX' ('label1' ASC, 'distance' ASC));
The drive had 50GB and ran out of space. Given 10GB reserved for the system, I am assuming the table was requesting more than > 32GB for the table.
My question is:
Should I simply order a bigger drive for my database server?
EDIT: I tracked down the data hog to "ibdata1", stored in /var/lib/mysql. This file is taking up 30.3GB.
Double trouble.
InnoDB takes 2x-3x what the raw data takes. This is a crude approximation; there are many factors.
ibdata1
is the default place to put the table. Having tried to put the table there, that file will not shrink. This can be a problem. It would have been better to have innodb_file_per_table = ON
before trying to load the file. Then the table would have gone into a separate .ibd
file, and upon failure, that file would have vanished. As it is, you are low on disk space with no simple way to recover it. (Recovery includes dumping all the other InnoDB tables, stopping mysqld, removing ibdata1, restarting, and then reloading the other tables.
Back to the ultimate problem... How to use the data. First, can we see a sample (a few rows) of the data. There may be some clues. How many rows in the table (or lines in the file)?
This may be a case for loading into MyISAM instead of InnoDB; the size for that table will be closer to 8.1GB, plus two indexes, which may add another 5-10GB. Still unpleasantly tight.
Normalizing the lab names would probably be a big gain. Suppose you have 10K labs and 100M distances (every lab to every other lab). Half of those are redundant? Normalizing lab names would save maybe 50 bites per row -- perhaps half the space?
Or you could get more disk space.
Ponder which suggestion(s) of the above you want to tackle; the let us know what you still need help with.