I have a function which looks like this:
def my_fun:
return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]
It just returns a list of integers, and yes I have tried a generator and it is super fast but unfortunately it has to be a list.
Pattern may look like this word1|word2|word3...
- many words :)
It is an effect of my optimization work but it is too slow still. What can I do to tune up the performance? Should I consider using Cython?
Example string: https://www.codepile.net/pile/bD9R0qZR
Example pattern: https://www.codepile.net/pile/eRPK025Y
As I am quite a rust fanboy, I implemented your problem in rust, once with a simple loop and the built-in match_indices
function, and once using the aho-corasick crate. Theoretically, aho-corasick should scale best, in case your real usecase is a lot larger than the example.
Then, after realizing how good the aho-corasick version is, I searched for a direct aho-corasick library for python, and found pyahocorasick.
Python version:
def my_fun(string, some_pattern):
return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]
Python version with pyahocorasick:
def my_fun(string, raw_pattern):
A = ahocorasick.Automaton()
for pattern in raw_pattern.split('|'):
A.add_word(pattern, len(pattern))
A.make_automaton()
result = []
for end_index, original_len in A.iter(string):
result.append(end_index - original_len + 1)
return result
Rust version, with match_indices
:
use pyo3::prelude::*;
#[pymodule]
fn __lib(_py: Python, m: &PyModule) -> PyResult<()> {
#[pyfn(m, "my_fun")]
fn my_fun(_py: Python, string: &str, raw_pattern: &str) -> PyResult<Vec<usize>> {
let mut results = vec![];
for pattern in raw_pattern.split('|'){
results.extend(string.match_indices(pattern).map(|(pos, _content)| pos));
}
results.sort();
results.dedup();
Ok(results)
}
Ok(())
}
Rust version, with aho-corasick 0.7.4:
use pyo3::prelude::*;
use aho_corasick::AhoCorasick;
#[pymodule]
fn __lib(_py: Python, m: &PyModule) -> PyResult<()> {
#[pyfn(m, "my_fun")]
fn my_fun(_py: Python, string: &str, raw_pattern: &str) -> PyResult<Vec<usize>> {
let ac = AhoCorasick::new(raw_pattern.split('|'));
Ok(ac.find_iter(string)
.map(|mat| mat.start())
.collect())
}
Ok(())
}
Results:
Python: 1.484 ms
Python AC: 0.214 ms
Rust: 0.506 ms
Rust AC: 0.199 ms
Results for string * 10
:
Python: 14.834 ms
Python AC: 0.671 ms
Rust: 2.936 ms
Rust AC: 0.360 ms
Results for string * 100
:
Python: 146.940 ms
Python AC: 5.071 ms
Rust: 20.243 ms
Rust AC: 1.758 ms
All tests were measured with timeit
for 1000 iterations and displayed in time per iteration.
The rust versions were implemented based on my rust submodule template.
For sake of correctness, those are the outputs of the algorithms:
Python: [500, 522, 864, 877, 1406, 1727, 1955]
Python AC: [500, 522, 864, 877, 1406, 1727, 1955]
Rust: [500, 522, 864, 877, 1406, 1727, 1955]
Rust AC: [500, 522, 864, 877, 1406, 1727, 1955]