Search code examples
python-3.xloggingword2vecword-embeddinganomaly-detection

How do I feed an array of Tokenized Sentences to Word2vec to get embeddings?


Hi all: I cannot figure out the code required to get embeddings from a word2vec model.

Here is how my df is structured (it is some android based log):

logDateTime | lineNum | processID | threadID | priority | app | message | eventTemplate | eventID ts int int int str str str str str

Essentially, I created a unique subset of events out of log messages and assigned a template with an associated id:

def eventCreation(df):
    df['eventTemplate'] = df['message'].str.replace('\d+', '*')
    df['eventTemplate'] = df['eventTemplate'].str.replace('true', '*')
    df['eventTemplate'] = df['eventTemplate'].str.replace('false', '*')
    df['eventID'] = df.groupby(df.eventTemplate.tolist(), sort=False).ngroup() + 1
    df['eventID'] = 'E'+df['eventID'].astype(str)

def seqGen(arr, k):
    for i in range(len(arr)-k+1):
        yield arr[i:i+k]

#define the variables here
cwd = os.getcwd()
#create a dataframe of the logs concatenated
df = pd.DataFrame.from_records(process_files(cwd,getFiles))
# call functions to establish df
cleanDf(df)
featureEng(df)
eventCreation(df)
df['eventToken'] = df.eventTemplate.apply(lambda x: word_tokenize(x))
seq = []
eventArray = df[["eventToken"]].to_numpy()
for sequence in seqGen(eventArray, 9):
    seq.append(eventArray)

So, 'seq' ends up looking like this:

[array([['[*,com.blah.blach.blahMainblach] '],
        ['[*,*,*,com.blah.blah/.permission.ui.blah,finish-imm] '],
        ['[*,*,*,*,startingNewTask] '],
        ...,
        ['mfc, isSoftKeyboardVisible in WMS : * '],
        ['mfc, isSoftKeyboardVisible in WMS : * '],
        ['Calling a method in the system process without a qualified user: android.app.ContextImpl.startService:* android.content.ContextWrapper.startService:* android.content.ContextWrapper.startService:* com.blahblah.usbmountreceiver.USBMountReceiver.onReceive:* android.app.ActivityThread.handleReceiver:* ']],
       dtype=object),

The sequences are arrays with lists of tokenized log messages. The plan was after training the model, I can get the embedding of a log event by multiplying the onehot vector and the weight matrix... there is more to do, but I am stuck at getting the embeddings.

I am a newbie trying to develop a solution for anomaly detection.


Solution

  • If you're using the Gensim library in Python for its Word2Vec implementation, it wants its corpus as a re-iterable sequence where each item is itself a list of string tokens.

    A list which itself has each item as a list-of-string-tokens would work.

    Your seq is close, but:

    1. It doesn't need to be (& thus probably shouldn't be) a numpy array of objects.
    2. Each of your object items is a list (good) but each has only has a single untokenized string inside (bad). You need to break those strings into the individual 'words' that you want the model to learn.