Search code examples
regexpython-3.xoptimizationstring-matchingmicro-optimization

Optimization of list comprehension with string matching


I have a function which looks like this:

def my_fun:
    return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]

It just returns a list of integers, and yes I have tried a generator and it is super fast but unfortunately it has to be a list. Pattern may look like this word1|word2|word3... - many words :) It is an effect of my optimization work but it is too slow still. What can I do to tune up the performance? Should I consider using Cython?

Example string: https://www.codepile.net/pile/bD9R0qZR

Example pattern: https://www.codepile.net/pile/eRPK025Y


Solution

  • As I am quite a rust fanboy, I implemented your problem in rust, once with a simple loop and the built-in match_indices function, and once using the aho-corasick crate. Theoretically, aho-corasick should scale best, in case your real usecase is a lot larger than the example.

    Then, after realizing how good the aho-corasick version is, I searched for a direct aho-corasick library for python, and found pyahocorasick.

    Python version:

    def my_fun(string, some_pattern):
        return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]
    

    Python version with pyahocorasick:

    def my_fun(string, raw_pattern):
        A = ahocorasick.Automaton()
        for pattern in raw_pattern.split('|'):
            A.add_word(pattern, len(pattern))
        A.make_automaton()
    
        result = []
        for end_index, original_len in A.iter(string):
            result.append(end_index - original_len + 1)
        return result
    

    Rust version, with match_indices:

    use pyo3::prelude::*;
    
    #[pymodule]
    fn __lib(_py: Python, m: &PyModule) -> PyResult<()> {
    
        #[pyfn(m, "my_fun")]
        fn my_fun(_py: Python, string: &str, raw_pattern: &str) -> PyResult<Vec<usize>> {
            let mut results = vec![];
    
            for pattern in raw_pattern.split('|'){
                results.extend(string.match_indices(pattern).map(|(pos, _content)| pos));
            }
            results.sort();
            results.dedup();
    
            Ok(results)
        }
    
        Ok(())
    }
    
    

    Rust version, with aho-corasick 0.7.4:

    use pyo3::prelude::*;
    
    use aho_corasick::AhoCorasick;
    
    #[pymodule]
    fn __lib(_py: Python, m: &PyModule) -> PyResult<()> {
    
        #[pyfn(m, "my_fun")]
        fn my_fun(_py: Python, string: &str, raw_pattern: &str) -> PyResult<Vec<usize>> {
            let ac = AhoCorasick::new(raw_pattern.split('|'));
    
            Ok(ac.find_iter(string)
                 .map(|mat| mat.start())
                 .collect())
        }
    
        Ok(())
    }
    

    Results:

    Python:     1.484 ms
    Python AC:  0.214 ms
    Rust:       0.506 ms
    Rust AC:    0.199 ms
    

    Results for string * 10:

    Python:    14.834 ms
    Python AC:  0.671 ms
    Rust:       2.936 ms
    Rust AC:    0.360 ms
    

    Results for string * 100:

    Python:    146.940 ms
    Python AC:   5.071 ms
    Rust:       20.243 ms
    Rust AC:     1.758 ms
    

    All tests were measured with timeit for 1000 iterations and displayed in time per iteration.

    The rust versions were implemented based on my rust submodule template.


    For sake of correctness, those are the outputs of the algorithms:

    Python:    [500, 522, 864, 877, 1406, 1727, 1955]
    Python AC: [500, 522, 864, 877, 1406, 1727, 1955]
    Rust:      [500, 522, 864, 877, 1406, 1727, 1955]
    Rust AC:   [500, 522, 864, 877, 1406, 1727, 1955]