Search code examples
javascriptreactjsregextypescript

How to extract code block substring from string


This is my first post here so hopefully I'm not posting incorrectly, but I'm basically looking to extract the code block (substring wrapped in backticks) from a string that was returned from an API request.

Here is the string:

const originalString = 'this is some string and I want to extract ```the code contained in the backticks```';

const whatIWant = 'the code contained in the back ticks';

I was thinking maybe I could use regex, but I wasn't able to come up with anything that would work.

I also tried using something like this:

originalString.substring(originalString .indexOf("```"), originalString .lastIndexOf("```") + 3);

but it produces undesired results when the text contains a mix of normal text and multiple code blocks.

Any ideas?


Solution

  • I see you already figured out a workaround, but I'll leave an answer anyway to possibly help other people with the same issue.

    You can indeed use regex to extract the code block from your string. Here goes an example using Javascript:

    const originalString = 'this is some string and I want to extract ```the code contained in the backticks``` and ```another code block```';
    
    const regex = /```(.*?)```/gs;
    const matches = [...originalString.matchAll(regex)];
    
    const codeBlocks = matches.map(match => match[1]);
    
    console.log(codeBlocks);
    // Output: ['the code contained in the backticks', 'another code block']
    

    Let's break down the regex pattern /```(.*?)```/gs step by step:

    1. / and the last / are the delimiters that mark the beginning and the end of the regex pattern.

    2. ``` is a literal match for three backticks. Since the backtick is not a special character in regex, you don't need to escape it. This part of the pattern matches the opening triple backticks.

    3. ( ) are used for capturing groups. Whatever is matched inside the parentheses will be captured as a group, which can later be accessed using array indices. In this case, we're capturing the content between the triple backticks.

    4. .*? Inside the capturing group, we have two parts:

      • . The dot is a metacharacter that matches any single character except newline characters. However, since we're using the s flag (explained later), the dot will also match newline characters in this case.
      • * The asterisk is a quantifier that matches zero or more occurrences of the preceding character or group. In this case, it refers to the dot, so .* means "match any sequence of characters."
      • ? The question mark, when placed after a quantifier (in this case, *), makes the quantifier non-greedy. This means the regex will try to find the shortest possible match instead of the longest. So, .*? means "match the shortest sequence of any characters."
    5. ``` This is another literal match for three backticks, representing the closing triple backticks.

    6. g This is a flag called the "global" flag. It is placed after the regex pattern delimiter, and it tells the regex engine to find all matches in the input string instead of stopping after the first match.

    7. s This is another flag called the "dotall" flag. It is also placed after the regex pattern delimiter, and it modifies the behavior of the dot . metacharacter, allowing it to match newline characters as well.

    The regex pattern /```(.*?)```/gs, in summary, finds all pairs of triple backticks and captures the content between them (including newline characters), while trying to match the shortest possible sequence of characters.

    You can experiment with the regular expression here.

    So let see how the code works:

    1. The regex searches for the code blocks.
    2. We use originalString.matchAll(regex) to get an iterator over all matches. We then use the spread operator ... to convert the iterator to an array.
    3. We use the map() function to extract the captured content (stored in match[1]) from each match and store it in an array called codeBlocks.
    4. codeBlocks will now contain all the code blocks found in the input string.

    Please let us know if it helps.