Search code examples
machine-learningdatasetfeature-engineering

preserving order information in a single feature


The following is one column of a dataset that I'm trying to feature engineer:

+---+-----------------------------+
|Id |events_list                  |
+---+-----------------------------+
|1  |event1,event3,event2,event1  |
+---+-----------------------------+
|2  |event3,event2                |
+---+-----------------------------+

There are 3 possible event types and the order which they arrived is saved as a string. I've transformed the events column like so:

+---+--------------------+
|Id |event1|event2|event3|
+---+--------------------+
|1  |2     |1     |1     |
+---+--------------------+
|2  |0     |1     |1     |
+---+--------------------+

Preserving the count information but loosing the order information.

Q: is there a way to encode the order as a feature?

Update: for each row of events I calculate a score for that day, the model should predict future score for new daily events. Anyway, my events order and count affects the daily score.

Update: My dataset contains other daily information such as sessions count etc’ and currently my model is an LSTM digesting each row by date. I want to try and improve my prediction by adding the order info to the existing model.


Solution

  • One option is to translate/transform the string directly by creating a meaningful mapping 1 --> 1 (i.e. one to one). In this case, preserving order is doable and has a meaning.

    This is simple demo:

    data = ['event1,event3,event2,event1', 'event2,event2', 'event1,event2,event3']
    
    def mapper(data):
        result = []
        for d in data:
            events = d.replace(' ', '').split(',')
            v = 0
            for i, e in enumerate(events):
                # for each string: get the sum of char values,
                # normalized by their orders
                # here 100 is optional, just to make the number small
                v += sum(ord(c) for c in e) / (i + 100) 
            result.append(v)
        return result
    
    new_data = mapper(data)
    print(new_data)
    

    Output:

    [23.480727373137086, 11.8609900990099, 17.70393127548049]
    

    Although the clashes probability is very low, there is no 100% guarantee that there will be no clashes at all for gigantic dataset.

    Check this analysis:

    # check for clashes on huge dataset
    import random as r
    import matplotlib.pyplot as plt
    
    r.seed(2020)
    
    def ratio_of_clashes(max_events):
        MAX_DATA = 1000000
        events_pool = [','.join(['event' + str(r.randint(1, max_events))
                             for _ in range(r.randint(1, max_events))])
                       for _ in range(MAX_DATA)]
        # print(events_pool[0:10])  # print few to see
        mapped_events = mapper(events_pool)
        return abs(len(set(mapped_events)) - len(set(events_pool))) / MAX_DATA * 100
    
    
    n_samples = range(5, 100)
    ratios = []
    for i in n_samples:
        ratios.append(ratio_of_clashes(i))
    
    plt.plot(n_samples, ratios)
    plt.title('The Trend of Crashes with Change of Number of Events')
    plt.show()
    

    enter image description here

    As a result, the less events or data you have, the lesser the clashes ratio, until it hits some threshold, then it flats out -- However it is after all not bad at all (Personally, I can live with it).


    Update & Final Thoughts:

    I just noticed that your are already using LSTM, thus the order extremely matters. In this case, I strongly suggest you Encode events into integers, then create a time series which perfectly fits in the LSTM, follow these steps:

    1. Pre-process each string and split them into events (as I did in the example).
    2. Fit LabelEncoder on them and transform them into integers.
    3. Scale the result into [0 - 1] by fitting MinMaxScaler.

    You will end up with something like this:

    'event1' : 1

    'event2' : 2

    'event3' : 3

    . . .

    'eventN' : N

    and for 'event1,event3,event2,event3', it will become: [1, 3, 2, 3]. Scaling --> [0, 1, 0.5, 1].

    The LSTM then is more than capable to figure out the order by nature. And forget about the dimensionality point, since it is LSTM which it is main job is to remember and optionally forget steps and orders of steps!.