Search code examples
rnlpfasttextlanguage-detection

FastText language_identification in R returns too many arguments - how to match to texts?


FastText language_identification returns multiple predictions per original text, and also fails to indicate which belong to which original document.

There are differing numbers of predictions per original document too -- their GitHub forums are closed now, but does anyone know how to match the output to the original texts?

Code:

DF = data.frame(doc_id = seq(1, 5),
speechtext = c("Hello. Fake text entry 1.", "Fake text entry 2", "more text", "Text in a
different language", "Hola"))

library(fastText)
# download .ftz pretrained model from https://fasttext.cc/docs/en/language-identification.html
file_ftz = system.file("language_identification/lid.176.ftz", package = "fastText")
lang1 = language_identification(DF$speechtext,
                                pre_trained_language_model_path = file_ftz,
                                verbose = T)

I was expecting one prediction per original text, or at least a consistent number, or some way of marking which document the predictions align with.

Really I could guess based on the largest number per series of a few elements outputted, but this doesn't seem optimal -- it does seem like a bug.

(I tried adding intern = T as an argument per R - fasttext how to load output into a dataframe from command line -- this is not recognized as an argument).


Solution

  • The first argument to fastText::language_identification() is defined as:

    either a valid character string to a valid path where each line represents a different text extract or a vector of text extracts (emphasis mine)

    You have line breaks in your input data:

    DF$speechtext[4]
    [1] "Text in a\ndifferent language"
    

    As one prediction is generated per line, you'll get two predictions from this element. You have two options:

    1. Remove new lines in your input data. This makes sense in this case.
    2. Keep new lines and map document IDs to each line. This makes sense if new lines might actually be in different languages.

    Remove new lines

    If you replace new lines with spaces you will get the same number of predictions returned as input rows.

    In the regex below, I have used the PCRE \v which matches newlines and any character considered vertical whitespace. This now produces five rows, one relating to each input row.

    language_identification(gsub("\\v", " ", DF$speechtext, perl = TRUE), file_ftz)
    #    iso_lang_1   prob_1
    #        <char>    <num>
    # 1:         en 0.220767
    # 2:         en 0.388695
    # 3:         en 0.613707
    # 4:         en 0.757671
    # 5:         es 0.721487
    

    \v includes several vertical space characters (such as form feed and line separator), so should cover all possible types of new line. For full details see the table here.

    Keep new lines and map document ID to each line

    Alternatively, if different lines of each input document might be in different languages, you may not want to remove new lines. In this case, you can predict each line separately and then map the document IDs to each line:

    # As before
    lang1 <- language_identification(DF$speechtext, file_ftz)
    
    # Add document IDs
    lang1$doc_id <- rep(
        DF$doc_id,
        lengths(strsplit(DF$speechtext, "\\v", perl = TRUE))
    )
    
    lang1
    #    iso_lang_1   prob_1 doc_id
    #        <char>    <num>  <int>
    # 1:         en 0.220767      1
    # 2:         en 0.388695      2
    # 3:         en 0.613707      3
    # 4:         en 0.932691      4
    # 5:         en 0.571937      4
    # 6:         es 0.721487      5