r hierarchical-clustering categorization

Using R to clean mis-spelt products to a taxonomy

I have a million records which contain a field called product which is filled in by users who use free text. Sometimes the field is blank sometimes it contains things such as batteries or laptop, other times it contains something like bateries or lptp which to the casual eye are the same but to a computer are very different.

I have no idea of the variation of the different types of products as they change each day (again we can get about a million records a day). I want to be able to clean the data and I'm looking for someone to point me in the right direction research papers blogs anything really that can help me categorize all these things

I have seen hierarchical clustering Text clustering with Levenshtein distances

I also had the idea of somehow getting a hold of Amazons Taxonomy and trying to get the products to map to this (I'm still investigating if this is possible). Google has something similar

I'm using R to prototype and i would just like to show it can be done

Thank you all for your time

Solution

I am not sure if you need clustering methods.

What you do need is some distance measure between strings (e.g. Levenshtein). You most likely would also need some sort of lexicon. But this had the same downsides as the approach I suggest below. If you had some correct words which had the same distance to a misspelt word, e.g. your lexicon contains "car", "cart", "card" and you would have the misspelt word "carx" you could not decide which is the right one. You would need context information (a car is more expensive than a card e.g.).

I will use frequencies of words. I think if you want to show that your problem is generally solvable this should suffice. My idea is that a misspelt word occurs rarely. A rarely occuring correct word could cause problems in my approach but only if other correct words with a small distance to it are present.

I will use the adist function in the stringdist package.

require (stringdist)

My example is

words <- c("monitor", "laptop", "mouse", "mouse", "keybsard", "monitor", 
"mous", "keyboard", "keyboard", "monitor", "keybxard", "monitor", 
"motse", "monitod", "laptop", "keyboard", "laptop", "mousq", 
"laptop", "mobitor", "keybolrd", "monitor", "mouse", "laptop", 
"monitor", "moute", "mouwe", "mwuse", "tonitor", "ltptop", "keybovrd", 
"monitor", "laptop", "moase", "keyboard", "keyboard", "keywoard", 
"laptnp", "laptop", "laptop")

Lets look at the frequencies:

 freq_table <- as.data.frame(table(words), stringsAsFactors = F)

It looks like this:

   words Freq
1  keyboard    5
2  keybolrd    1
3  keybovrd    1
4  keybsard    1
5  keybxard    1
6  keywoard    1
7    laptnp    1
8    laptop    8
9    ltptop    1
10    moase    1
11  mobitor    1
12  monitod    1
13  monitor    7
14    motse    1
15     mous    1
16    mouse    3
17    mousq    1
18    moute    1
19    mouwe    1
20    mwuse    1
21  tonitor    1

Now I separate in 'good' and 'bad' words:

s <- split(freq_table, freq_table$Freq < 3)
good <- s[['FALSE']]
good <- good[order(-good$Freq),]$words
bad <- s[['TRUE']]$words

What I have done is splitted the frequency table in entries which occur 3 or more times and entries that occur less than 3 times. I will explain later why I sorted the good ones. Now we have good:

[1] "laptop"   "monitor"  "keyboard" "mouse"

and bad:

[1] "keybolrd" "keybovrd" "keybsard" "keybxard" "keywoard" "laptnp"   "ltptop"   "moase"   
[9] "mobitor"  "monitod"  "motse"    "mous"     "mousq"    "moute"    "mouwe"    "mwuse"   
[17] "tonitor"

Now I calculate the distance matrix between the good and the bad words:

dis <- adist(bad,good)

and look if there are good words in the 'neighbourhood' of bad words.

hits <- sapply(1:NROW(dis), function (i) which(dis[i,] < 3)[1])

I always take the first hit. Since we sorted the good words before, the first hit would be the most frequent word amongst the hits. In that way I want to avoid that a word which is misspelt, but often in the same way, is used as a correct word. This will not work always, its a heuristic.

Now I generate some sort of look up table df:

bad_corr <- bad
ind <- which(!is.na(hits))
bad_corr[ind] <- good[hits[ind]]
df <- data.frame(bad, bad_corr, stringsAsFactors = F)

It looks like:

      bad bad_corr
1  keybolrd keyboard
2  keybovrd keyboard
3  keybsard keyboard
4  keybxard keyboard
5  keywoard keyboard
6    laptnp   laptop
7    ltptop   laptop
8     moase    mouse
9   mobitor  monitor
10  monitod  monitor
11    motse    mouse
12     mous    mouse
13    mousq    mouse
14    moute    mouse
15    mouwe    mouse
16    mwuse    mouse
17  tonitor  monitor

This I use to replace the misspelt words. Summarized, the whole function is this:

correct <- function (words, minfreq = 3, sensitivity = 3) {
  freq_table <- as.data.frame(table(words), stringsAsFactors = F)
  s <- split(freq_table, freq_table$Freq < minfreq)
  good <- s[['FALSE']]
  good <- good[order(-good$Freq),]$words
  bad <- s[['TRUE']]$words
  dis <- adist(bad,good)
  hits <- sapply(1:NROW(dis), function (i) which(dis[i,] < sensitivity)[1])

  bad_corr <- bad
  ind <- which(!is.na(hits))
  bad_corr[ind] <- good[hits[ind]]

  df <- data.frame(bad, bad_corr, stringsAsFactors = F)
  ind <- match(words, df$bad)
  words[which(!is.na(ind))] <- df$bad_corr[ind[!is.na(ind)]] 
  words
}

sensitivity says, how 'far away' a misspelt word is allowed to be. minfreq means that every word that occurs less than minfreq times is seen as possibly misspelt (but will only be replaced if there is a more frequent word with string distance smaller than sensitivity). However this function is not perfect. If you dont have misspelt words at all for example it will cause an error. So if you want to use it you should refine it further.

The result of all this:

      words correct.words.
1   monitor        monitor
2    laptop         laptop
3     mouse          mouse
4     mouse          mouse
5  keybsard       keyboard
6   monitor        monitor
7      mous          mouse
8  keyboard       keyboard
9  keyboard       keyboard
10  monitor        monitor
11 keybxard       keyboard
12  monitor        monitor
13    motse          mouse
..   ......       ........

Finally if you had a taxonomy, you would omit the frequency part and store the words of the taxonomy in good.

Good Luck!