Search code examples
rapache-sparkapache-spark-mlsparklyrcountvectorizer

how to form a vocabulary based tfidf sparklyr dataframe


Would have to build a Tf-idf matrix/dataframe with terms/words as column names instead of indices using sparklyr. I went with ft_count_vectorizer because of its provision to store vocabulary. But I am stuck after finding the tf-idf i am unable to map the terms to its tf-idf values.Any help in this space would be highly appreciated.Here is what I tried.

tf_idf<-cleantext %>%
  ft_tokenizer("Summary", "tokenized") %>%
  ft_stop_words_remover(input.col = "tokenized", output.col = "clean_words",
                        ml_default_stop_words(sc,language = ("english"))) %>%
  ft_count_vectorizer(input_col = "clean_words",output_col="tffeatures")%>%
  ft_idf(input_col="tffeatures",output_col="tfidffeatures")

tf-idf is a spark_tbl class which would also include clean_words(vocabulary) and tfidf features.Both these features are present as lists. I need to provide tfidf features as an input with clean_words as the column headings. What is the best way to do it. I am stuck here. Any assistance or help would be highly appreciated.


Solution

  • While technically possible operation like this has not many practical applications. Apache Spark is not optimized for handling execution plans with wide data, like the one which might be generated by expanding vectorized columns.

    If you still follow through you'll have to extract the vocabulary preserved by the CountVectorizer. One possible approach is to use ML Pipelines (you can check my answer to how to train a ML model in sparklyr and predict new values on another dataframe? for a detailed explanation).

    • Using transformers you have you can define the Pipeline and fit the PipelineModel:

      model <- ml_pipeline( 
        ft_tokenizer(sc, "Summary", "tokenized"),
        ft_stop_words_remover(sc, input.col = "tokenized",
                              output.col = "clean_words",
                              ml_default_stop_words(sc, language = "english")),
        ft_count_vectorizer(sc, input_col = "clean_words",
                            output_col = "tff eatures"),
        ft_idf(sc, input_col = "tffeatures",output_col = "tfidffeatures")
      ) %>% ml_fit(cleantext)
      
    • Then retrieve the CountVectorizerModel and extract vocabulary:

      vocabulary <- ml_stage(model, "count_vectorizer")$vocabulary %>% unlist()
      
    • Finally transform the data, apply sdf_separate_column, and select the columns of interest:

      ml_transform(model, cleantext) %>% 
        sdf_separate_column("tfidffeatures", vocabulary) %>% 
        select(one_of(vocabulary))