FastText language_identification returns multiple predictions per original text, and also fails to indicate which belong to which original document.
There are differing numbers of predictions per original document too -- their GitHub forums are closed now, but does anyone know how to match the output to the original texts?
Code:
DF = data.frame(doc_id = seq(1, 5),
speechtext = c("Hello. Fake text entry 1.", "Fake text entry 2", "more text", "Text in a
different language", "Hola"))
library(fastText)
# download .ftz pretrained model from https://fasttext.cc/docs/en/language-identification.html
file_ftz = system.file("language_identification/lid.176.ftz", package = "fastText")
lang1 = language_identification(DF$speechtext,
pre_trained_language_model_path = file_ftz,
verbose = T)
I was expecting one prediction per original text, or at least a consistent number, or some way of marking which document the predictions align with.
Really I could guess based on the largest number per series of a few elements outputted, but this doesn't seem optimal -- it does seem like a bug.
(I tried adding intern = T as an argument per R - fasttext how to load output into a dataframe from command line -- this is not recognized as an argument).
The first argument to fastText::language_identification()
is defined as:
either a valid character string to a valid path where each line represents a different text extract or a vector of text extracts (emphasis mine)
You have line breaks in your input data:
DF$speechtext[4]
[1] "Text in a\ndifferent language"
As one prediction is generated per line, you'll get two predictions from this element. You have two options:
If you replace new lines with spaces you will get the same number of predictions returned as input rows.
In the regex below, I have used the PCRE \v
which matches newlines and any character considered vertical whitespace. This now produces five rows, one relating to each input row.
language_identification(gsub("\\v", " ", DF$speechtext, perl = TRUE), file_ftz)
# iso_lang_1 prob_1
# <char> <num>
# 1: en 0.220767
# 2: en 0.388695
# 3: en 0.613707
# 4: en 0.757671
# 5: es 0.721487
\v
includes several vertical space characters (such as form feed and line separator), so should cover all possible types of new line. For full details see the table here.
Alternatively, if different lines of each input document might be in different languages, you may not want to remove new lines. In this case, you can predict each line separately and then map the document IDs to each line:
# As before
lang1 <- language_identification(DF$speechtext, file_ftz)
# Add document IDs
lang1$doc_id <- rep(
DF$doc_id,
lengths(strsplit(DF$speechtext, "\\v", perl = TRUE))
)
lang1
# iso_lang_1 prob_1 doc_id
# <char> <num> <int>
# 1: en 0.220767 1
# 2: en 0.388695 2
# 3: en 0.613707 3
# 4: en 0.932691 4
# 5: en 0.571937 4
# 6: es 0.721487 5