Search code examples
google-apps-scriptgoogle-sheetsxpathgoogle-sheets-formula

XPATH inside IMPORTXML: how to find text in all languages?


I'm trying to find all mentions of a Google Sheets' cell value on some website page. The cell value can be written in any language. The website page can also be in any language.

Now I have this formula:

=COUNTA(IFERROR(IMPORTXML(A2;"//*[contains(translate(text(),'ABCDEFGHJIKLMNOPQRSTUVWXYZАБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЭЮЯ', 'abcdefghjiklmnopqrstuvwxyzабвгдеёжзиклмнопрстуфхцчшщэюя'),'"&LOWER(B2)&"')]")))

But it works incorrectly. It's not possible to find all words in Latin and Cyrillic, characters from Portuguese, German and other languages ​​are not taken into account. How to make the formula universal? Or how can I write the appropriate script for Google Sheets?

My formula works incorrectly. I want to find any text in any language on website page with Google Sheets' formula or script.


Solution

  • We could solve this more elegantly if Google Docs would support Xpath 2.0 and we could use the fn:lower-case(). In Xpath 1.0 we are stuck with translate() and need to provide the upper-case and lower-case translation ourselves. However, it is difficult to cover all possible (Unicode) characters in all languages this way.1 I have added the diacritics from the Latin alphabet to your formula, which should cover most Western languages.

    =COUNTA(IFERROR(IMPORTXML(A2;"//*[contains(translate(text(),
    'ABCDEFGHJIKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞŸŽŠŒАБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЭЮЯ', 
    'abcdefghjiklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿžšœабвгдеёжзиклмнопрстуфхцчшщэюя'),'"&LOWER(B2)&"')]")))
    

    1: Furthermore, such a transformation does not work perfectly for all languages, e.g., the German 'ß' (lowercase) may be written as 'SS' when in uppercase which makes it difficult to find all matches.