I am using the KWIC function in quanteda package in R to look up some phrases in Kurdish. In Kurdish, some compound words and phrases are separated by half-space. When I use a phrase including a half-space, R considers it as a typo(the red dot) and does not let me run the command. Is there a way to fix this?
The half-space or a zero-width non-joiner is used in some languages to avoid a ligature when normalizing a text. Its Unicode character is '\u200c' and in some text-editors, it can be shown on the screen with a SHIFT+SPACE.
kwic(cleantest, phrase("لهلایهنی"), window = 1)
Here is the image of the error
Also, do you know of a Sorani Kurdish POS Tagger and a Stemmer?
Interesting problem. We have been thinking about this here and here recently.
Apparently the problem arises in the phrase conversion to a list, which relies on whitespace splitting. Here is a workaround to ensure that the half-spaces are converted into full spaces:
txt <- "رۆژنامهكانى بهریتانیا، ئاماژه بۆ ئهوه دهكهن كه سهرهڕای ئهوهی ڤینگهر دهزانێت له وهرزی داهاتوودا گهورهترین كێشهی لهلایهنی گۆڵپارێزی دهبێت، بهڵام لهگهڵ ئهوهشدا ئاماده نییه بههیچ .شێوهیهك پیتهر چیك لهسهر كورسی یهدهگ دابنێت "
phrase2 <- function(x) phrase(gsub("\\s", " ", x))
kwic(txt, phrase2("لهلایهنی"), window = 1)
# [text1, 33:35] ی | له لایه نی | گۆڵپارێزی
And no, I do not know of a Sorani Kurdish POS Tagger and a Stemmer, although the stopwords package does include Kurdish stopwords.
stopwords("ku", source = "stopwords-iso")
# [1] "ئێمە" "ئێوە" "ئەم" "ئەو" "ئەوان" "ئەوەی"
# [7] "بۆ" "بێ" "بێجگە" "بە" "بەبێ" "بەدەم"
# [13] "بەردەم" "بەرلە" "بەرەوی" "بەرەوە" "بەلای" "بەپێی"
# [19] "تۆ" "تێ" "جگە" "دوای" "دوو" "دە"
# [25] "دەکات" "دەگەڵ" "سەر" "لێ" "لە" "لەبابەت"
# [31] "لەباتی" "لەبارەی" "لەبرێتی" "لەبن" "لەبەر" "لەبەینی"
# [37] "لەدەم" "لەرێ" "لەرێگا" "لەرەوی" "لەسەر" "لەلایەن"
# [43] "لەناو" "لەنێو" "لەو" "لەپێناوی" "لەژێر" "لەگەڵ"
# [49] "من" "ناو" "نێوان" "هەر" "هەروەها" "و"
# [55] "وەک" "پاش" "پێ" "پێش" "چەند" "کرد"
# [61] "کە" "ی"