Search code examples
nlptext-processingtext-classificationfasttext

Text preprocessing for fasttext pretrained models


I want to use pretreained fastext model for language detection: https://fasttext.cc/docs/en/language-identification.html . Where can I find the exact Python code for text preprocessing used for training this specific model? I am not interested in general answers about how should we prepare text for using models - I ma looking for identical transformations as those used for training.


Solution

  • When the Facebook engineers have been asked similar questions in their Github repository issues, they've usually pointed to one or the other of two shell scripts in their public code (& especially the 'normalize_text' functions within).

    https://github.com/facebookresearch/fastText/blob/master/tests/fetch_test_data.sh#L20

    normalize_text() {
      tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
        sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' \
            -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
            -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " " | myshuf
    }
    

    https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh#L12

    normalize_text() {
        sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
            -e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
            -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
            -e 's/«/ /g' | tr 0-9 " "
    }
    

    They've also referenced this page's section on 'Tokenization' (which names some libraries), and the academic paper which describes the earlier work making individual language vectors.

    None of these are guaranteed to exactly match what was used to create their pretrained classification models, & it's a bit frustrating that each release of such models doesn't contain the exact code to reproduce. But, these sources seem to be as much detail as is available, without getting direct answers/help from the team that created them.