Search code examples
pythonpandasscikit-learnsvmlight

How to load SVMlight format files in compressed form to pandas?


I have data in SVMlight format (label feature1:value1 feature2:v2 ...) as such

talk.politics.guns a:12 about:1 abrams:1 absolutely:1
talk.politics.mideast I:4 run:10 go:3

I tried sklearn.load_svmlight_file but it doesn't seem to work with categorical string features and labels. I am trying to store it into pandas DataFrame. Any pointers would be appreciated.


Solution

  • You can do it by hand... One way you can convert the file you want in a DataFrame:

    svmformat_file = """~/svmformat_file_sample"""
    
    # Read to list
    with open(svmformat_file, mode="r") as fp:
        svmformat_list = fp.readlines()
    
    # For each line we save the key:values to a dict
    pandas_list = []
    for line in svmformat_list:
        line_dict = dict()
    
        line_split = line.split(' ')
        line_dict["label"] = line_split[0]
    
        for col in line_split[1:]:
            col = col.rstrip()  # Remove '\n'
            col_split = col.split(':')
            key, value = col_split[0], col_split[1]
            line_dict[key] = value
    
        pandas_list.append(line_dict)
    

    The result DataFrame with your example file:

    pd.DataFrame(pandas_list)
    

    enter image description here