Search code examples
regexpermutationmatchingwords

Match words of permutations of another word using regex


I have a chunk of words, all of which are valid English words I'm going to query with RegExp.

What I need is to match words which contains the letters of a specified word in any order.

Example (A segment):

...
peloton
pelt
pelta
peltae
peltast
....

I should be able to fill in a regex for "leap" and collect "pelta", "peltae" and "peltast" along with other words within the database. (Such as: "selfpreservatory")

What I have:

/^([chars]).*(?:\1|([chars])).*(?:\1|\2|([chars])).*{et cetera}.*(?:\1|\2|{et cetera}|\{n-1}|([chars]))(?{n})$/

(Fill in {et cetera} and {n}, {n-1} with respective to word length)

This is how it ('s supposed to) works:

You start with a pool of characters in your word, which hopefully does not have any repeating characters. (This group is [chars].) At first it matches the first character it sees that is in [chars]. Then when it looks for the next character in [chars], it either matches the first match, and captures nothing, or matches anything else in the pool, and captures that next character. Essentially, the second (?:) group removes the first match from the pool of characters. Once it captures n characters it checks to see if the nth character has actually matched. If it hasn't, then it doesn't match the word.

This iteration does not really work though. What is a correct attempt to this?

Note: I am not grepping, so I do need to use ^$. Instead of \b.

Thanks in advance!

Edit: I've tried this approach also. It's not working at all.

/^(([chars]).*(?!\1|\2)){n}$/

Solution

  • Using lookaheads, with "leap" as an example:

    \b(?=[a-z]*l)(?=[a-z]*e)(?=[a-z]*a)(?=[a-z]*p)[a-z]+\b
    

    Fiddle: http://refiddle.co/12u4

    EDIT: I added \b anchors (word boundaries); the leading one is especially important, otherwise "appeal" might be captured three times ("appeal", "ppeal", "peal"). Feel free to use other anchors when appropriate (e.g. ^...$).

    By the way, this approach is also suitable to match the same character more than once. Say you want to match all words containing the letters "pop" (i.e. at least two "p", and at least one "o").

    \b(?=[a-z]*p[a-z]*p)(?=[a-z]*o)[a-z]+\b
    

    Or with a quantifier:

    \b(?=([a-z]*p){2})(?=[a-z]*o)[a-z]+\b
    

    Both will match "pop", "pope", "oppress", but not "poke".