Search code examples
mysqlsqlstringtexttrim

How to crop text between braces


I have data in MySQL in one string field with below structure:

{language *lang_code*}text{language}{language *lang_code*}text{language}

And here is example:

{language en}text in english{language}{language de}text in german{language}

The ideal output for this would be in this case

text in english

So we want to disregard the other languages, just want to extract the first one, and put it into new column, because it's often the title of the product, with translations, and for us the first one is the most important.

The values in first braces may be different, so for example here the first one is english, but in other example it might be in german, so the lang code might also be dynamic.

I am wondering if it's possible to extract the text value between two first braces through SQL query?


Solution

  • This is really horrible but it works for your simple example -

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language en\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
    

    or

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language de\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
    

    to retrieve the german text.

    To retrieve the first text in the string regardless of language you can use -

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language [a-z]{2}\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
    

    Note this version assumes the language code is always 2 x a-z chars - [a-z]{2}

    Here is an example of the above wrapped in a stored function -

    DELIMITER $$
    
    CREATE FUNCTION `ExtractLangString`(content TEXT, lang CHAR(8))
    RETURNS text
    DETERMINISTIC
    BEGIN
        -- if lang is not 2 chars in length or lang not found return first language string
        IF LENGTH(lang) <> 2 OR content NOT LIKE CONCAT('%{language ', lang, '}%') THEN
            SET lang = '[a-z]{2}';
        END IF;
        RETURN SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR(content, CONCAT('\\{language ', lang, '\\}(.*?)\\{language\\}')), '}', -2), '{', 1);
    END$$
    
    DELIMITER ;
    

    There is probably a cleaner way of doing it but I cannot think of it right now.

    Obviously, the better solution would be to normalise the data that is currently serialised into this column.