I have a million records which contain a field called product which is filled in by users who use free text. Sometimes the field is blank sometimes it contains things such as batteries
or laptop
, other times it contains something like bateries
or lptp
which to the casual eye are the same but to a computer are very different.
I have no idea of the variation of the different types of products as they change each day (again we can get about a million records a day). I want to be able to clean the data and I'm looking for someone to point me in the right direction research papers blogs anything really that can help me categorize all these things
I have seen hierarchical clustering Text clustering with Levenshtein distances
I also had the idea of somehow getting a hold of Amazons Taxonomy and trying to get the products to map to this (I'm still investigating if this is possible). Google has something similar
I'm using R to prototype and i would just like to show it can be done
Thank you all for your time
I am not sure if you need clustering methods.
What you do need is some distance measure between strings (e.g. Levenshtein). You most likely would also need some sort of lexicon. But this had the same downsides as the approach I suggest below. If you had some correct words which had the same distance to a misspelt word, e.g. your lexicon contains "car", "cart", "card" and you would have the misspelt word "carx" you could not decide which is the right one. You would need context information (a car is more expensive than a card e.g.).
I will use frequencies of words. I think if you want to show that your problem is generally solvable this should suffice. My idea is that a misspelt word occurs rarely. A rarely occuring correct word could cause problems in my approach but only if other correct words with a small distance to it are present.
I will use the adist
function in the stringdist
package.
require (stringdist)
My example is
words <- c("monitor", "laptop", "mouse", "mouse", "keybsard", "monitor",
"mous", "keyboard", "keyboard", "monitor", "keybxard", "monitor",
"motse", "monitod", "laptop", "keyboard", "laptop", "mousq",
"laptop", "mobitor", "keybolrd", "monitor", "mouse", "laptop",
"monitor", "moute", "mouwe", "mwuse", "tonitor", "ltptop", "keybovrd",
"monitor", "laptop", "moase", "keyboard", "keyboard", "keywoard",
"laptnp", "laptop", "laptop")
Lets look at the frequencies:
freq_table <- as.data.frame(table(words), stringsAsFactors = F)
It looks like this:
words Freq
1 keyboard 5
2 keybolrd 1
3 keybovrd 1
4 keybsard 1
5 keybxard 1
6 keywoard 1
7 laptnp 1
8 laptop 8
9 ltptop 1
10 moase 1
11 mobitor 1
12 monitod 1
13 monitor 7
14 motse 1
15 mous 1
16 mouse 3
17 mousq 1
18 moute 1
19 mouwe 1
20 mwuse 1
21 tonitor 1
Now I separate in 'good' and 'bad' words:
s <- split(freq_table, freq_table$Freq < 3)
good <- s[['FALSE']]
good <- good[order(-good$Freq),]$words
bad <- s[['TRUE']]$words
What I have done is splitted the frequency table in entries which occur 3 or more times and entries that occur less than 3 times. I will explain later why I sorted the good ones. Now we have good:
[1] "laptop" "monitor" "keyboard" "mouse"
and bad:
[1] "keybolrd" "keybovrd" "keybsard" "keybxard" "keywoard" "laptnp" "ltptop" "moase"
[9] "mobitor" "monitod" "motse" "mous" "mousq" "moute" "mouwe" "mwuse"
[17] "tonitor"
Now I calculate the distance matrix between the good and the bad words:
dis <- adist(bad,good)
and look if there are good words in the 'neighbourhood' of bad words.
hits <- sapply(1:NROW(dis), function (i) which(dis[i,] < 3)[1])
I always take the first hit. Since we sorted the good words before, the first hit would be the most frequent word amongst the hits. In that way I want to avoid that a word which is misspelt, but often in the same way, is used as a correct word. This will not work always, its a heuristic.
Now I generate some sort of look up table df
:
bad_corr <- bad
ind <- which(!is.na(hits))
bad_corr[ind] <- good[hits[ind]]
df <- data.frame(bad, bad_corr, stringsAsFactors = F)
It looks like:
bad bad_corr
1 keybolrd keyboard
2 keybovrd keyboard
3 keybsard keyboard
4 keybxard keyboard
5 keywoard keyboard
6 laptnp laptop
7 ltptop laptop
8 moase mouse
9 mobitor monitor
10 monitod monitor
11 motse mouse
12 mous mouse
13 mousq mouse
14 moute mouse
15 mouwe mouse
16 mwuse mouse
17 tonitor monitor
This I use to replace the misspelt words. Summarized, the whole function is this:
correct <- function (words, minfreq = 3, sensitivity = 3) {
freq_table <- as.data.frame(table(words), stringsAsFactors = F)
s <- split(freq_table, freq_table$Freq < minfreq)
good <- s[['FALSE']]
good <- good[order(-good$Freq),]$words
bad <- s[['TRUE']]$words
dis <- adist(bad,good)
hits <- sapply(1:NROW(dis), function (i) which(dis[i,] < sensitivity)[1])
bad_corr <- bad
ind <- which(!is.na(hits))
bad_corr[ind] <- good[hits[ind]]
df <- data.frame(bad, bad_corr, stringsAsFactors = F)
ind <- match(words, df$bad)
words[which(!is.na(ind))] <- df$bad_corr[ind[!is.na(ind)]]
words
}
sensitivity
says, how 'far away' a misspelt word is allowed to be. minfreq
means that every word that occurs less than minfreq
times is seen as possibly misspelt (but will only be replaced if there is a more frequent word with string distance smaller than sensitivity
). However this function is not perfect. If you dont have misspelt words at all for example it will cause an error. So if you want to use it you should refine it further.
The result of all this:
words correct.words.
1 monitor monitor
2 laptop laptop
3 mouse mouse
4 mouse mouse
5 keybsard keyboard
6 monitor monitor
7 mous mouse
8 keyboard keyboard
9 keyboard keyboard
10 monitor monitor
11 keybxard keyboard
12 monitor monitor
13 motse mouse
.. ...... ........
Finally if you had a taxonomy, you would omit the frequency part and store the words of the taxonomy in good
.
Good Luck!