Search code examples
pythondeep-learningword2vecword-embeddingfasttext

Should I Pass Word2Vec and FastText Vectors Separately or Concatenate Them for Deep Learning Model in Smart Contract Vulnerability Detection?


i have been working with word embedding latly, i have a question. So, here consider taking vulnerability detection in smart contract. So the input is smart contract files labeled with 0 or 1 stating vulnerable or not. Now i m performing 2 different word embedding such as word2vec and fasttext with same input. My question is, it is right to concatenate the vectors of word2vec and fasttext and then fed as input to deep learning model. or should i pass the vectors of the word embedding models separately to the deep learning model for feature extraction and then concatenate the extracted features for classification.

So far, i have performed word embedding using word2vec and passed the vectors obtained to cnn and performed fasttext and passed it to BiGRU model and concatenated the extracted features. My question is can i concatenate the vectors before performing feature extraction using deep learning ? But i m afraid that the concatenation of 2 word embedding models with same input will cause confusion ? that is the same input words will have two different vectors when concatenated. i m so confused. If anybody have insight kindly help. Thanks in advance.


Solution

  • The generic answer for when you don't know which of multiple different ideas is better, youy try them each separately & see which evaluates as better on your robust, repeatable evaluations.

    (If you don't have a way to evaluate which is better, that's a bigger & more foundational thing to address than any other choices.)

    Given what you've said, other observations:

    The word2vec & FastText algorithms are very similar, with the most experience supporting their use being in the fuzzy sorts of menaings inherent in natural-language text. And, the main advantage of FastText is in being able to synthesize better-than-nothing guess-vectors for words that weren't seen during training, but might be similar in substrings that hint their meaning to other known words.

    Smart contract source code (or bytecode) is sufficiently unlike natural language, in its narrow vocabulary, token frequencies, purposes, & rigorous execution model that it's not immediately clear word-vectors could help. Word-vectors often have been useful with language-like token-sets that aren't natural-language, but even there, usually for discovering gradations of meaning. With smart contracts, the difference between "Works as hoped" and "fatally vulnerable" may just be a tiny matter of a single misplaced operation, or subtle missed error case. Those are the kind of highly contextual, ordering-based outcomes that word-vectors simply do not model. (At best, I think you might discover that competent coders tend to use mroe of certain kinds of operations or names than incompetent ones.)

    Further, the main advantage of FastText – synthesizing vectors for unknown but morphologically-similar tokens – may be far less relevant for bytecode-analysis, where unknown tokens are rare or even impossible. (Maybe, if you're analyzing source-code including freely chosen variable names, new unknown variable names will have hints of relations to previously-trained names.)

    So: word-vectors may the an improper or underpowered tool for doing the sort of high-stakes, subtle classification you're attempting. But, as with the topmost answer: the only way to know, & test ideas of what works or not, is to try each approach, & evaluate it in some fair, repeatable way. (This even includes testing different ways of training the word-vectors from a single algorithm like just word2vec itself: different modes, parameters, preprocessing, etc.)