Search code examples
pythonpattern-finding

How to find patterns between numerious causes and the result in python?


For each instance I have a set of problems and a result, like this:

df = pd.DataFrame({
    "problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
    "results": ["A", "A", "C", "C", "A"]
})

enter image description here

I want to find patterns in the relationship between the problems and the result.

My first thought was Association Rule Mining, but this is more for finding patters within the problems (for example). I guess machine learning could help somehow, but I'm not interested in solely predicting the result, but in the patters that lead to that prediction.

I would be interested in patters like

  • Problem 1 causes result A
  • The combination of problems 4 and 5 cause result C

Any thoughts on that? As I'd implement with Python, corresponding packages are welcomed hints, too.

Thanks a lot!


Solution

  • I was curious and I did some experimental stuff, based on the comment of Daniel Möller in this thread in tensorflow 2.0 with keras:

    Update: Make the order not matter anymore:

    To make the order not matty anymore, we need to remove the order information from our dataset. To do this, we first convert it to a one-hot vector, then we take the max() value to squash the dimensions into 3 again:

    x_no_order = tf.keras.utils.to_categorical(x)
    

    This gives us a one-hot vector looking like this:

    array([[[0., 1., 0., 0., 0., 0., 0.],
    [0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0.]],
    
       [[0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0.]],
    
       [[0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0.]],
    
       [[0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0.]],
    
       [[0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 1.]]], dtype=float32)
    

    Taking the np.max() from that vector gives us a vector, that only knows about which numbers occur, without any information about the position, looking like this:

    x_no_order.max(axis=1)
    
    array([[0., 1., 1., 1., 0., 0., 0.],
       [0., 1., 1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 1., 0.],
       [0., 0., 0., 1., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1.]], dtype=float32)
    

    First create the dataframe and create the training data

    Thats a multiclass-classification task, so I use the tokenizer (there are for sure better approaches, since its rather for text)

    import tensorflow as tf
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({
        "problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
        "results": ["A", "A", "C", "C", "A"]
    })
    
    x = df['problems']
    y = df['results']
    
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(y)
    y_train = tokenizer.texts_to_sequences(y)
    
    x = np.array([np.array(i,dtype=np.int32) for i in x])
    y_train = np.array(y_train, dtype=np.int32)
    

    **Then create the model **

    input_layer = tf.keras.layers.Input(shape=(3))
    dense_layer = tf.keras.layers.Dense(6)(input_layer)
    dense_layer2 = tf.keras.layers.Dense(20)(dense_layer)
    out_layer = tf.keras.layers.Dense(3, activation="softmax")(dense_layer2)
    
    model = tf.keras.Model(inputs=[input_layer], outputs=[out_layer])
    model.compile(optimizer="Nadam", loss="sparse_categorical_crossentropy",metrics=["accuracy"])
    

    Train the model by fitting it

    hist = model.fit(x,y_train, epochs=100)
    

    Then, as based on Daniels comment, you take the sequence you want to test and mask out certain values, to test their influence

    arr =np.reshape(np.array([1,2,3]), (1,3))
    print(model.predict(arr))
    arr =np.reshape(np.array([0,2,3]), (1,3))
    print(model.predict(arr))
    arr =np.reshape(np.array([1,0,3]), (1,3))
    print(model.predict(arr))
    arr =np.reshape(np.array([1,2,0]), (1,3))
    print(model.predict(arr))
    

    This will print this result, have in mind that since y starts at one, the first value is a placeholder, so the second value stands for "A"

    [[0.00441748 0.7981055  0.19747704]]
    [[0.00103579 0.9863035  0.01266076]]
    [[0.0031549  0.9953074  0.00153765]]
    [[0.01631758 0.00633342 0.977349  ]]
    

    There we can see, that in the first place A is correctly predicted by 0.7981.. When the of [1,2,3] we change 3 to 0, so [1,2,0] we see that the model suddenly predicts "C". So the influence of 3 on position 3 is the biggest. Putting that in a function, you can use all training data you have and build statistic metrics to analyze that further.

    This is just a very simple approach, but keep in mind that it is a big research field called sensitivity analysis. You might want to have a deeper look at that topic, if you are interested.