Search code examples
rnlptext-miningarabicarabic-support

separating an Arabic sentence into words results in a different number of words with different functions


I am trying to separate one Arabic sentence, Verse 38:1 of Quran, with the tm and tokenizers packages but they split the sentence differently into 3 and 4 words, respectively. Can someone explain (1) why this is and (2) what is the meaning of this difference from NLP and Arabic-language points of view? Also, is one of them wrong? I am by no means expert in NLP nor Arabic but trying to run the codes.

Here are the codes I tried:

library(tm)
library(tokenizers)
# Verse 38:1
verse<- "ص والقرآن ذي الذكر"

# This separates into to 3 words by tm library 
a <- colnames(DocumentTermMatrix(Corpus(VectorSource(verse) )))
a
# "الذكر"   "ذي"      "والقرآن"

# This separates into 4 words by 
b <- tokenizers::tokenize_words(verse)
b
# "ص"       "والقرآن" "ذي"      "الذكر"  

I would expect them to be equal but they are different. Can someone explain what is going on here?


Solution

  • It doesn't have anything to do with NLP or the Arabic language, there are simply some defaults you have to watch out for. DocumentTermMatrix has a number of default parameters that can be changed via control. Run ?termFreq to see them all.

    One of those defaults is wordLengths:

    An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.

    So, if we run the following we get 3 words because the dropped word has fewer than 3 characters:

    dtm <- DocumentTermMatrix(Corpus(VectorSource(verse)))
    inspect(dtm)
    
    #### OUTPUT ####
    
    <<DocumentTermMatrix (documents: 1, terms: 3)>>
    Non-/sparse entries: 3/0
    Sparsity           : 0%
    Maximal term length: 7
    Weighting          : term frequency (tf)
    Sample             :
        Terms
    Docs الذكر ذي والقرآن
       1     1  1       1
    

    To return all words, regardless of length, we need to change c(3, Inf) to c(1, Inf):

    dtm <- DocumentTermMatrix(Corpus(VectorSource(verse)),
                              control = list(wordLengths = c(1, Inf))
                              )
    inspect(dtm)
    
    #### OUTPUT ####
    
    <<DocumentTermMatrix (documents: 1, terms: 4)>>
    Non-/sparse entries: 4/0
    Sparsity           : 0%
    Maximal term length: 7
    Weighting          : term frequency (tf)
    Sample             :
        Terms
    Docs الذكر ذي ص والقرآن
       1     1  1 1       1
    

    The default makes sense because the default language is English, where words with less than three characters are articles, prepositions, etc., but it might make less sense with other languages. Definitely take the time to play around with the other parameters related to different tokenizers, language settings, etc. The current results look pretty good, but you might have to tweak some settings as your text becomes more complicated.