Hi all: I cannot figure out the code required to get embeddings from a word2vec model.
Here is how my df is structured (it is some android based log):
logDateTime | lineNum | processID | threadID | priority | app | message | eventTemplate | eventID ts int int int str str str str str
Essentially, I created a unique subset of events out of log messages and assigned a template with an associated id:
def eventCreation(df):
df['eventTemplate'] = df['message'].str.replace('\d+', '*')
df['eventTemplate'] = df['eventTemplate'].str.replace('true', '*')
df['eventTemplate'] = df['eventTemplate'].str.replace('false', '*')
df['eventID'] = df.groupby(df.eventTemplate.tolist(), sort=False).ngroup() + 1
df['eventID'] = 'E'+df['eventID'].astype(str)
def seqGen(arr, k):
for i in range(len(arr)-k+1):
yield arr[i:i+k]
#define the variables here
cwd = os.getcwd()
#create a dataframe of the logs concatenated
df = pd.DataFrame.from_records(process_files(cwd,getFiles))
# call functions to establish df
cleanDf(df)
featureEng(df)
eventCreation(df)
df['eventToken'] = df.eventTemplate.apply(lambda x: word_tokenize(x))
seq = []
eventArray = df[["eventToken"]].to_numpy()
for sequence in seqGen(eventArray, 9):
seq.append(eventArray)
So, 'seq' ends up looking like this:
[array([['[*,com.blah.blach.blahMainblach] '],
['[*,*,*,com.blah.blah/.permission.ui.blah,finish-imm] '],
['[*,*,*,*,startingNewTask] '],
...,
['mfc, isSoftKeyboardVisible in WMS : * '],
['mfc, isSoftKeyboardVisible in WMS : * '],
['Calling a method in the system process without a qualified user: android.app.ContextImpl.startService:* android.content.ContextWrapper.startService:* android.content.ContextWrapper.startService:* com.blahblah.usbmountreceiver.USBMountReceiver.onReceive:* android.app.ActivityThread.handleReceiver:* ']],
dtype=object),
The sequences are arrays with lists of tokenized log messages. The plan was after training the model, I can get the embedding of a log event by multiplying the onehot vector and the weight matrix... there is more to do, but I am stuck at getting the embeddings.
I am a newbie trying to develop a solution for anomaly detection.
If you're using the Gensim library in Python for its Word2Vec
implementation, it wants its corpus as a re-iterable sequence where each item is itself a list of string tokens.
A list which itself has each item as a list-of-string-tokens would work.
Your seq
is close, but:
numpy
array of objects.object
items is a list
(good) but each has only has a single untokenized string inside (bad). You need to break those strings into the individual 'words' that you want the model to learn.