Search code examples
rdataframetokenizetidytextmecab

Tokenizing Japanese text in R: Only first line of the specified column is tokenized


I am trying to tokenize a collection of tweets with the Japanese tokenizer RMeCab, specifically the function RMeCabDF (for dataframes).

The documentation states the following usage:

RMeCabDF

Description

RMeCabDF takes data frames as the first argument, and analyzes the columns specified by the second argument. Blank data should be replaced with NA. If 1 is designated as the third argument, it returns each morpheme in its basic form.

Usage

RMeCabDF(dataf, coln, mypref, dic = "", mecabrc = "", etc = "")

Arguments

dataf data.frame

coln Column number or name which include Japanese sentences

mypref Default being 0, the same morphemic forms that appear on the text are returned. If 1 is designated, the basic forms of them are instead.

dic to specify user dictionary, e.x. ishida.dic

mecabrc not implemented (to specify mecab resource file)

etc other options to mecab

So following this, I use the following code to tokenize the column number 89 in the dataframe trump_ja:

trump_ja_tokens <- RMeCabDF(trump_ja, coln = 89)

This results in a List of 1 - but as you can see, the dataframe has 989 rows.

enter image description here

Where did my other rows go?

Do I have to tokenize row by row? If so, is there any way to automate this process to avoid typing 1000 lines of code (or using Excel to produce 1000 lines of code)?


Solution

  • You can use the RMeCab tokenizer with tidytext, in the way that this user did. You would set it up like so:

    df %>%
        unnest_tokens(word, text, token = RMeCab::RMeCabC)
    

    where df is your data frame, word is the new column you are going to create, and text is the old column that you already have that contains the text you want to tokenize. The token argument in unnest_tokens() can take a function as an argument, for cases just like these.