I'm trying to write a GtkSourceView language file to highlight some of my files in gedit. The problem I'm encountering is that I want to highlight words that contain at least the first four characters and are correctly spelled. To illustrate, say I have four patterns:
variable
vari
variab
variabel
and I want to identify the first three, but not the fourth, because the first three are all correctly spelled substrings of the target "variable". What gets the job done is using
\bvari(a|ab|abl|able)?\b
but this can become quite tedious with longer words. So in a full lang-file it would look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<language id="foo" _name="foo" version="2.0" _section="Other">
<metadata>
<property name="mimetypes">text/x-foo</property>
<property name="globs">*.foo</property>
</metadata>
<styles>
<style id="keyword" _name="Keyword" map-to="def:keyword"/>
</styles>
<default-regex-options case-sensitive="false"/>
<definitions>
<context id="foo">
<include>
<context id="keyword" style-ref="keyword">
<keyword>\bvari(a|ab|abl|able)\b</keyword>
</context>
</include>
</context>
</definitions>
</language>
I was not able to find a solution to this - because I'm extremely unfamiliar with regex and do not know the correct phrasing for this question. Is there a simple and efficient solution to this problem?
Unfortunately, there isn't really a less tedious way to do it.
About your pattern: Note that GtkSourceView
uses the PCRE regex engine that is an NFA regex engine. So when you write an alternation, the first alternative (from left to right) that matches will succeed and the regex engine will not test other alternatives more far on the right, example for the string abcdef
the pattern (a|ab|abc|abcde|abcdef)
will return a
(when a DFA will return the longest alternative that matches, so abcdef
)
This mean that your pattern works only because there is a word-boundary at the end (for the whole word variable
, each alternative succeed, but once the word boundary reached, the regex engine must backtrack and test the next alternative and so on until the last.)
Conclusion, it's better to write your alternation from the longest alternative to the shortest, to avoid unnecessary work to the engine, so:
\bvari(able|abl|ab|a)?\b
An other possibility is to design your pattern like that:
\bvari(a(b(le?)?)?)?\b
In this case the regex engine goes straight to the end of the pattern without to have to find the good alternation. But note that it isn't more simple to write but a little shorter since you do not have to write letters several times!