Search code examples
pythonnormalizationfeature-scaling

Is there a function to normalize strings and convert them to integers/floats?


I have multiple lists of features which are strings that I want to analyze. That is, e.g.:

[["0.5", "0.4", "disabled", "0.7", "disabled"], ["feature1", "feature2", "feature4", "feature1", "feature3"]]

I know how to convert strings like "0.5" to floats, but is there a way to "normalize" such lists to integer or float values (each list independently in my case)? I would like to get something like this:

[[2, 1, 0, 3, 0], [0, 1, 3, 0, 2]]

Does anyone know how to achieve this? Unfortunately I couldn't to find anything related to this problem yet.


Solution

  • Use a dictionary and a counter to give IDs to new values and remember past IDs:

    import itertools, collections
    
    def norm(lst):
        d = collections.defaultdict(itertools.count().__next__)
        return [d[s] for s in lst]
    
    lst = [["0.5", "0.4", "disabled", "0.7", "disabled"],
           ["feature1", "feature2", "feature4", "feature1", "feature3"]]
    print(list(map(norm, lst)))
    # [[0, 1, 2, 3, 2], [0, 1, 2, 0, 3]]
    

    Or by enumerating sorted unique values; note, however, that "disables" sorts after the numeric values:

    def norm_sort(lst):
        d = {x: i for i, x in enumerate(sorted(set(lst)))}
        return [d[s] for s in lst]
    
    print(list(map(norm_sort, lst)))
    [[1, 0, 3, 2, 3], [0, 1, 3, 0, 2]]