Search code examples
javascriptregexcomparisonlocalearabic

How to find and remove first/starting string from an Arabic string having diacritics but maintaining the original diacritics of remaining string


The aim is to find and remove a starting string/chars/word from an Arabic string that we don't know if it has diacritics or not but must maintain any and all diacritics of the remaining string (if any).

There are many answers for removing the first/starting string/chars from an English string on StackOverflow, but there is no existing solution to this problem found on StackOverflow that maintains the balance of the Arabic string in its original form.

If the original string is normalized (removing the diacritics, tanween, etc.) before processing it, then the remaining string returned will be the balance of the normalized string, not the remaining of the original string.

Example. Assume the following original string which can be in any of the following forms (i.e. the same string but different diacritics):

1. "السلام عليكم ورحمة الله"

2. "السَلام عليكمُ ورحمةُ الله"

3. "السَلامُ عَليكمُ ورَحمةُ الله"

4. "السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله"

Now let us say we want to remove the first/staring characters "السلام" only if the string starts with such characters (which it does), and return the remaining of the "original" string with its original diacritics.

Of course, we are looking for the characters "السلام" without diacritics because we don't know how the original string is formatted with diacritics.

So, in this case, the returned remaining of each string must be:

1. " عليكم ورحمة الله"

2. " عليكمُ ورحمةُ الله"

3. " عَليكمُ ورَحمةُ الله"

4. " عَلَيْكُمُ وَرَحْمَةُ الله"

The following code works for an English string (there are many other solutions) but not for an Arabic string as explained above.

function removeStartWord(string,word) {
if (string.startsWith(word)) string=string.slice(word.length);
return string;
}

The above code uses the principle of slicing the starting characters found from the original string based on the characters' length; which works fine for English text.

For an Arabic string, we don't know the form of diacritics of the original string and thus the length of the string/characters we are looking for in the original string will be different and unknown.

Edit: Added example image for better clarifications.

The following image table provides further examples:

enter image description here


Solution

  • To keep track of the discussion, I'm adding a new answer, try this please!

    function removeStartWord(string, word) {
      const alphabeticString =  string.replace(/[^a-zA-Zء-ي0-9/]+/g, '');
      if(!alphabeticString.startsWith(word)) return string;
      const letters = [...word];
      let cleanString = '';
      string.split('').forEach((_letter) => {
        if(letters.indexOf(_letter) > -1) {
          delete letters[letters.indexOf(_letter)]
        }else{
          cleanString += _letter;
        }
      });
      return cleanString.replace(/[^a-zA-Zء-ي0-9/\s]*/i, '');
    }
    
    const sampleData = `السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;
    
    console.log('sampleData ...', sampleData);
    console.log(
      "removeStartWord(sampleData, 'السلام') ...",
      removeStartWord(sampleData, 'السلام')
    );
    console.log(
      "removeStartWord(sampleData, 'الس') ...",
      removeStartWord(sampleData, 'الس')
    );
    console.log(
      "removeStartWord(sampleData, 'السلام ') ...",
      removeStartWord(sampleData, 'السلام ')
    );
    console.log(
      "removeStartWord(sampleData, ' السلام') ...",
      removeStartWord(sampleData, ' السلام')
    );
    .as-console-wrapper { min-height: 100%!important; top: 0; }