Search code examples
pythonmachine-learningliblinear

liblinear memory cost too much


I have run the liblinear to modeling a model file.

The python code is here:

y, x = svm_read_problem(vector_file)
prob = problem(y, x)
param = parameter('-s 2 -c 1')
m = train(prob, param)
save_model(model_file, m)

The problem is that when the vector_file is about 247MB, the total cost of memory when running liblinear is about 3.08GB. Why does it cost so much?

And In my project, the vector_file will be as large as 2GB, how can I use liblinear to train the problem, then I can get a model file?


Solution

  • Okey, I know why the problem is.

    When read the problem, the python interface of liblinear use:

    prob_y = []
    prob_x = []
    
    for line in open(data_file_name):
        line = line.split(None, 1)
        # In case an instance with all zero features
        if len(line) == 1: line += ['']
        label, features = line
        xi = {}
        for e in features.split():
            ind, val = e.split(":")
            xi[int(ind)] = float(val)
        prob_y += [float(label)]
        prob_x += [xi]
    
    return (prob_y, prob_x)
    

    In python, int costs 28 bytes and float costs 24 bytes, which is out of my imagination.

    I will post such cases to the author.