Search code examples
jqueryregexhighlight

Text Matching not working for Arabic issue may be due to regex for arabic


I have been working to add a functionality to my multilingual website where i have to highlight the matching tag keywords.

This functionality works for English version but doesn't not fire for arabic version.

I have set up sample on JSFiddle

Sample Code

    function HighlightKeywords(keywords)
    {        
        var el = $("#article-detail-desc");
        var language = "ar-AE";
        var pid = 32;
        var issueID = 18; 
        $(keywords).each(function()
        {
           // var pattern = new RegExp("("+this+")", ["gi"]); //breaks html
            var pattern = new RegExp("(\\b"+this+"\\b)(?![^<]*?>)", ["gi"]); //looks for match outside html tags
            var rs = "<a class='ad-keyword-selected' href='http://www.alshindagah.com/ar/search.aspx?Language="+language+"&PageId="+pid+"&issue="+issueID+"&search=$1' title='Seach website for:  $1'><span style='color:#990044; tex-decoration:none;'>$1</span></a>";
            el.html(el.html().replace(pattern, rs));
        });
    }   

HighlightKeywords(["you","الهدف","طهران","سيما","حاليا","Hello","34","english"]);

//Popup Tooltip for article keywords
     $(function() {
        $("#article-detail-desc").tooltip({
        position: {
            my: "center bottom-20",
            at: "center top",
            using: function( position, feedback ) {
            $( this ).css( position );
            $( "<div>" )
            .addClass( "arrow" )
            .addClass( feedback.vertical )
            .addClass( feedback.horizontal )
            .appendTo( this );
        }
        }
        });
    });

I store keywords in array & then match them with the text in a particular div.

I am not sure is problem due to Unicode or what. Help in this respect is appreciated.


Solution

  • There are three sections to this answer

    1. Why it's not working

    2. An example of how you could approach it in English (meant to be adapted to Arabic by someone with a clue about Arabic)

    3. A stab at doing the Arabic version by someone (me) who hasn't a clue about Arabic :-)

    Why it's not working

    At least part of the problem is that you're relying on the \b assertion, which (like its counterparts \B, \w, and \W) is English-centric. You can't rely on it in other languages (or even, really, in English — see below).

    Here's the definition of \b in the spec:

    The production Assertion :: \ b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following:

    • Let e be x's endIndex.
    • Call IsWordChar(e–1) and let a be the Boolean result.
    • Call IsWordChar(e) and let b be the Boolean result.
    • If a is true and b is false, return true.
    • If a is false and b is true, return true.
    • Return false.

    ...where IsWordChar is defined further down as basically meaning one of these 63 characters:

    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
    A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
    0  1  2  3  4  5  6  7  8  9  _    

    E.g., the 26 English letters a to z in upper or lower case, the digits 0 to 9, and _. (This means you can't even rely on \b, \B, \w, or \W in English, because English has loan words like "Voilà", but that's another story.)

    A first example using English

    You'll have to use a different mechanism for detecting word boundaries in Arabic. If you can come up with a character class that includes all of the Arabic "code points" (as Unicode puts it) that make up words, you could use code a bit like this:

    var keywords = {
        "laboris": true,
        "laborum": true,
        "pariatur": true
        // ...and so on...
    };
    var text = /*... get the text to work on... */;
    text = text.replace(
        /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
        replacer);
    
    function replacer(m, c0, c1) {
        if (keywords[c0]) {
            c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
    }
    

    Notes on that:

    • I've used the class [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] to mean "a word character". Obviously you'd have to change this (markedly) for Arabic.
    • I've used the class [^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] to mean "not a word character". This is just the same as the previous class with the negation (^) at the outset.
    • The regular expression finds any series of "word characters" followed by an optional series of non-word characters, using capture groups ((...)) for both.
    • String#replace calls the replacer function with the full text that matched followed by each capture group as arguments.
    • The replacer function looks up the first capture group (the word) in the keywords map to see if it's a keyword. If so, it wraps it in an anchor.
    • The replacer function returns that possibly-wrapped word plus the non-word text that followed it.
    • String#replace uses the return value from replacer to replace the matched text.

    Here's a full example of doing that: Live Copy | Live Source

    <!DOCTYPE html>
    <html>
    <head>
    <meta charset=utf-8 />
    <title>Replacing Keywords</title>
    </head>
    <body>
      <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
      
      <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
      <script>
        (function() {
          // Our keywords. There are lots of ways you can produce
          // this map, here I've just done it literally
          var keywords = {
            "laboris": true,
            "laborum": true,
            "pariatur": true
          };
          
          // Loop through all our paragraphs (okay, so we only have one)
          $("p").each(function() {
            var $this, text;
            
            // We'll use jQuery on `this` more than once,
            // so grab the wrapper
            $this = $(this);
            
            // Get the text of the paragraph
            // Note that this strips off HTML tags, a
            // real-world solution might need to loop
            // through the text nodes rather than act
            // on the full text all at once
            text = $this.text();
    
            // Do the replacements
            // These character classes match JavaScript's
            // definition of a "word" character and so are
            // English-centric, obviously you'd change that
            text = text.replace(
              /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
              replacer);
            
            // Update the paragraph
            $this.html(text);
          });
    
          // Our replacer. We define it separately rather than
          // inline because we use it more than once      
          function replacer(m, c0, c1) {
            // Is the word in our keywords map?
            if (keywords[c0]) {
              // Yes, wrap it
              c0 = '<a href="#">' + c0 + '</a>';
            }
            return c0 + c1;
          }
        })();
      </script>
    </body>
    </html>
    

    A stab at doing it with Arabic

    I took at stab at the Arabic version. According to the Arabic script in Unicode page on Wikipedia, there are several code ranges used, but all of the text in your example fell into the primary range of U+0600 to U+06FF.

    Here's what I came up with: Fiddle (I prefer JSBin, what I used above, but I couldn't get the text to come out the right way around.)

    (function() {
        // Our keywords. There are lots of ways you can produce
        // this map, here I've just done it literally
        var keywords = {
            "الهدف": true,
            "طهران": true,
            "سيما": true,
            "حاليا": true
        };
        
        // Loop through all our paragraphs (okay, so we only have two)
        $("p").each(function() {
            var $this, text;
            
            // We'll use jQuery on `this` more than once,
            // so grab the wrapper
            $this = $(this);
            
            // Get the text of the paragraph
            // Note that this strips off HTML tags, a
            // real-world solution might need to loop
            // through the text nodes rather than act
            // on the full text all at once
            text = $this.text();
            
            // Do the replacements
            // These character classes just use the primary
            // Arabic range of U+0600 to U+06FF, you may
            // need to add others.
            text = text.replace(
                /([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/g,
                replacer);
            
            // Update the paragraph
            $this.html(text);
        });
        
        // Our replacer. We define it separately rather than
        // inline because we use it more than once      
        function replacer(m, c0, c1) {
            // Is the word in our keywords map?
            if (keywords[c0]) {
                // Yes, wrap it
                c0 = '<a href="#">' + c0 + '</a>';
            }
            return c0 + c1;
        }
    })();
    

    All I did to my English function above was:

    • Use [\u0600-\u06ff] to be "a word character" and [^\u0600-\u06ff] to be "not a word character". You may need to add some of the other ranges listed here (such as the appropriate style of numerals), but again, all of the text in your example fell into those ranges.
    • Change the keywords to be three of yours from your example (only two of which seem to be in the text).

    To my very non-Arabic-reading eyes, it seems to work.