Search code examples
pythonmachine-learningmathlogicartificial-intelligence

Reduce a list of tuples without losing information for ML model


I have a streaming flow that is partially composed by a list of tuple4 that can have 1 or more elements, let's take some examples:

1)"st_li_list":[{"f0":3,"f1":4,"f2":1,"f3":12,"arity":4},{"f0":1,"f1":3,"f2":1,"f3":3,"arity":4},{"f0":15,"f1":12,"f2":1,"f3":180,"arity":4}]}' 2)"st_li_list":[{"f0":1,"f1":24,"f2":8,"f3":24,"arity":4},{"f0":50,"f1":11,"f2":1,"f3":550,"arity":4},{"f0":2,"f1":10,"f2":3,"f3":20,"arity":4},{"f0":15,"f1":10,"f2":1,"f3":150,"arity":4}, {"f0":4,"f1":6,"f2":2,"f3":24,"arity":4},{"f0":1,"f1":3,"f2":1,"f3":3,"arity":4}]}' 3)"st_li_list":[{"f0":15,"f1":12,"f2":1,"f3":180,"arity":4}]}'

As you can see the list1_ has 3 elements, list2_ has 6 and list_3 only one. I would like to do some standardization or encoding that let me to create a vector that is a sort of "summary" but keep the same size all the time to feed an ML model without losing any information. The fact that list_1 has 3 element is definetly a useful information for the "summary vector" that would probably have "3" has a first element on the list, followed by....? (any length is fine, so even if it is 100 elements is okay)

I don't want to setup a specific range for each parameter because it would force a range that can be wrong.

Any super clever solution on how this can be achieved in Python would be super appreciated!! There are maybe some algos around that can do this?


Solution

  • Either you feed the whole sequence to a NN, or you need to summarize each sequence as a fixed-length vector of features. The way to do this depends on what the information represents, in general you could use:

    • number of values
    • minimum/maximum value,
    • mean/median,
    • standard dev.
    • quantiles

    But for example if the sequences represent an evolution, maybe calculating the average/overall increase/decrease rate makes sense. If they represent some kind of coordinates between objects, maybe calculating average/overall distances makes sense. Etc.