Search code examples
rdplyrtmfuzzy-search

classifying identically pattern in words using R


I want conduct text mining analysis, but face with any troubles. Using dput(), i load little part of my text.

text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 
3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 
3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 
3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L, 
17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L, 
7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg", 
"* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g", 
"197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g", 
"2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g", 
"3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g", 
"809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+", 
"MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL", 
"Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.", 
"SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL", 
"TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow"
), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER", 
"GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))

(NA is accidentally. ) The body of text is names of product from check.

I want to group any similar names.

For example. Here i manually take MAKFA makar(Ukraine name). I found 7 rows with "root or key word MAKFA Makar"

Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in / with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g

All product position have same root word. MAKFA Makar can't be something like MFAMKR As output i want to get

                                                Initially                 class
1                       Pasta Makfa snail flow-pack 450 g.          MAKFA Makar.
2                  MAKFA Macaroni feathers like. in / with          MAKFA Makar.
3                          2013077 MAKFA Makar.RAKERS 450g          MAKFA Makar.
4                          2013077 MAKFA Makar.RAKERS 450g          MAKFA Makar.
5                              6788 MAKFA Makar.perya 450g          MAKFA Makar.
6                         2049750 MAKFA Makar.SHIGHTS 450g          MAKFA Makar.
7                         2049750 MAKFA Makar.SHIGHTS 450g          MAKFA Makar.
8          * 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35                  kolb
9               * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg             Spikachki
10                                         809 Bananas 1kg              Bananas 
11                                              Lemons 55+                Lemons
12                           Napkins paper color 100pcs PL        Napkins paper 
13                         SOFT Cotton sticks 100 PE (BELL         Cotton sticks
14                     SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15 * 3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g              CAT seed
16                        FetaXa Cheese product 60% 400g (               Cheese 
17          3491144 LIP.NAP.ICE TEA green yellow 0.5 liter                  TEA 
18                  2030918 MARIA TRADITIONAL Biscuit 180g              Biscuit 
19                                          197 Onion 1 kg                 Onion
20                          TOBUSsteering-wheel 0.5kg flow        steering-wheel
21                     Package "Magnet" white (Plastiktre) Package  (Plastiktre)
22                    * 2108609 SLOB.Mayon.OLIVK.67% 400ml                 Mayon
23                            TENDER AGE Cottage cheese 10        Cottage cheese

How can i classify the product by root words?(rather, the presence of an identically pattern in words Makar.Makfa, cheese)


Solution

  • I think you can get where you want by cleansing and then clustering your texts - here's a starter:

    text <- text[1:24,]
    library(quanteda)
    library(tidyverse)
    hc <- text %>% 
      pull(GOODS_NAME) %>% 
      as.character %>% 
      quanteda::tokens(
        remove_numbers = T,  
        remove_punct = T,
        remove_symbols = T, 
        remove_separators = T
      ) %>% 
      quanteda::tokens_tolower() %>% 
      quanteda::tokens_remove(valuetype="regex", pattern = c("^\\d.*")) %>% 
      quanteda::dfm() %>% 
      textstat_simil(method = "jaccard") %>% 
      magrittr::multiply_by(-1) %>% 
      `attr<-`("Labels", text$GOODS_NAME) %>% 
      hclust(method = "average") 
    
    pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
    plot(hc)
    dev.off()
    shell.exec(tf)
    
    clusters <- cutree(hc, h = -0.1)
    split(text, clusters)