Search code examples
pythontensorflowtextdeep-learningnlp

How to integer coding values for text data?


I've been looking at how to prepare dataset for deep learning models.

If we have a data like this,

data = [['this', 'is'], ['not', 'with']]

first they get the frequency of words in our corpus. Based on a word frequency integer label was assigned to word.

The word which is more frequent got assigned 1, then 2 and so on..

My question is why do we need to do that? Can't we just randomly assigned integer values for words. Does it increase accuracy if we following that rule.


Solution

  • I doubt it has any effect on accuracy, unless maybe you're doing something unusual later on

    I could see it having effects on:

    • performance: common words will be clustered together (near zeroth index) and hence likely to end up in cache together
    • human interpretation/readability: strings/display output will tend to be "tidier" with common words needing less digits
    • easy handling of rare words; all index values over some threshold indicate the word is rare and can be mapped to some placeholder / ignored (depending on how the model handles this)