pythonstringalgorithmpattern-matching# Python: Rabin-Karp algorithm hashing

I am implementing Rabin-Karp algorithm for fun. I came across this pseudocode:

```
RABIN -KARP -MATCHER (T, P, d, q)
1 n = T.length
2 m = P.length
3 h = d^(m-1) mod q
4 p=0
5 t= 0
6 for i = 1 to m
/ preprocessing
/
7 p = (dp + P [i]) mod q
8 t = (dt + T [i]) mod q
9 for s = 0 to n-m
/ matching
/
10 if p == t
11 if P [1... m] == T [s + 1...s + m]
12 print “Pattern occurs with shift” s
13 if s < n-m
14 t = (d(t-T[s + 1]h) + T [s + m + 1]) mod q
```

I implemented in Python 2.7 like so:

```
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t =0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m):
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here
return result
```

where result is a list containing the index in text of pattern.

My code is failing to find the 26 in 3141592653589793 and I suspect it has something to do with my hashcode defined by line 14 of the pseudocode. Can anyone please help with this.

I passed in the following paramters:

P = "26" T = "3141592653589793" d = 257 q = 11

P and T must be strings/arrays of chars

Solution

Here is a working version of your code:

```
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t = 0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m+1): # note the +1
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (t-h*ord(text[s]))%q # remove letter s
t = (t*d+ord(text[s+m]))%q # add letter s+m
t = (t+q)%q # make sure that t >= 0
return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))
```

The output is:

```
[6]
[0, 1, 2, 3]
```

On the first step, you check whether `text[0..m] == pattern`

. On the second step, you want to check whether `text[1..m+1] == pattern`

. Thus you remove `text[0]`

from the hash (at the moment it is multiplied by your precomputed `h`

): `t = (t-h*ord(text[s]))%q`

. And then, add `text[m]`

to it: `t = (t*d+ord(text[s+m]))%q`

. On the next step, you remove `text[1]`

and add `text[m+1]`

, and so on. The `t = (t+q)%q`

line is there because a negative number modulo `q`

yields remainder in the range `(-q; 0]`

, and we want it to be in the range `[0; q)`

.

Note that you want to check a total of `n-m+1`

substrings, not `n-m`

, hence the correct loop is `for s in range(n-m+1)`

. Checked by the second example (finding "xx" in "xxxxx").

Also worth noting:

The line

`h = pow(d,m-1)%q`

may be too slow if`m`

is large. It is better to take the result modulo`q`

after each of the`m-2`

multiplications. Or just use the built-in way:`h = pow(d,m-1,q)`

, as suggested by @oBrstisf8o in the comments.This algorithm is still O(nm) in the worst case. With

`text="a"*100000`

and`pattern="a"*50000`

, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime number`q`

may help reduce the probability of a false positive to an acceptable level.

- Python Jinja2 LaTeX Table
- Getting attributes of a class
- How can I print many significant figures in Python?
- How to allow list append() method to return the new list
- Calculate Last Friday of Month in Pandas
- Python type hint for Iterable[str] that isn't str
- How to iterate over a list in chunks
- How to exit the entire application from a Python thread?
- Running shell command and capturing the output
- How do I pass a variable by reference?
- Convert range(r) to list of strings of length 2 in python
- How can I get the start and end dates for each week?
- how to use send_message() in python-telegram-bot
- Python conditional replacement based on element type
- How can I count the number of items in an arbitrary iterable (such as a generator)?
- Find longest consecutive range of numbers in list
- Insert text in braces with asyncpg
- How does one put a link / url to the web-site's home page in Django?
- How to determine if a path is a subdirectory of another?
- Custom Keybindings for Ipython terminal
- FastAPI asynchronous background tasks blocks other requests?
- How to make sure that information from one file is duplicated into several text documents, without specific lines
- Installing a Python environment with Anaconda
- sklearn pipeline model predicting same results for all input
- Brew command not found after installing Anaconda Python
- How to get an XPath from selenium webelement or from lxml?
- Pipe PuTTY console to Python script
- How to align the axes of a figure in matplotlib?
- Persist ParentDocumentRetriever of langchain
- How to reset index in a pandas dataframe?