Search code examples
javascriptregexstringreducetokenize

How to extract symbols, numbers and words from a string and store each into an accordingly categorized array?


How does one extract symbols, numbers, words with maximum 3 and words with at least 4 letters from a string and store each into an accordingly categorized array?

The given string is:

const string = 'There are usually 100 to 200 words + in a paragraph'; 

The expected response is:

const numbers = ['200', '100'];

const wordsMoreThanThreeLetters = ['There', 'words ', 'paragraph', 'usually'];

const symbols = ['+'];

const words = ['are', 'to', 'in', 'a'];

Solution

  • A valid approach was to split the string at any whitespace-sequence and then to operate a reduce method on the split method's result array.

    The reducer function will be implemented in a way that it collects and aggregates the string items (tokens) within specific arrays according to the OP's categories, supported by helper methods for e.g. digit and word tests ...

    function collectWordsDigitsAndRest(collector, token) {
      const isDigitsOnly = value => (/^\d+$/).test(token);
      const isWord = value => (/^\w+$/).test(token);
    
      const listName = isDigitsOnly(token)
        ? 'digits'
        : (
            isWord(token)
            ? (token.length <= 3) && 'shortWords' || 'longWords'
            : 'rest'
        );
      (collector[listName] ??= []).push(token);
    
      return collector;
    }
    const {
    
      longWords: wordsMoreThanThreeLetters = [],
      shortWords: words = [],
      digits: numbers = [],
      rest: symbols = [],
    
    } = 'There are usually 100 to 200 words + in a paragraph'
    
      .split(/\s+/)
      .reduce(collectWordsDigitsAndRest, {});
    
    console.log({
      wordsMoreThanThreeLetters,
      words,
      numbers,
      symbols,
    });
    .as-console-wrapper { min-height: 100%!important; top: 0; }

    Of cause one also could matchAll the required tokens by a single regular expression / RegExp which features named capturing groups and also uses Unicode escapes in order to achieve a better internationalization (i18n) coverage.

    The regex itself would look and work like this ...

    ... derived from ...

    The reducer function of the first approach has to be adapted to this second approach in order to process each captured group accordingly ...

    function collectWordsDigitsAndRest(collector, { groups }) {
      const { shortWord, longWord, digit, rest } = groups;
    
      const listName = (shortWord
        && 'shortWords') || (longWord
        && 'longWords') || (digit
        && 'digits') || (rest
        && 'rest');
    
      if (listName) {
        (collector[listName] ??= []).push(shortWord || longWord || digit || rest);
      }
      return collector;
    }
    
    // Unicode Categories ... [https://www.regularexpressions.info/unicode.html#category]
    // regex101.com ... [https://regex101.com/r/nCga5u/2]
    const regXWordDigitRestTokens =
      /(?:\b(?<digit>\p{N}+)|(?<longWord>\p{L}{4,})|(?<shortWord>\p{L}+)\b)|(?<rest>[^\p{Z}]+)/gmu;
    
    const {
    
      longWords: wordsMoreThanThreeLetters = [],
      shortWords: words = [],
      digits: numbers = [],
      rest: symbols = [],
    
    } = Array
      .from(
        'There are usually 100 to 200 words ++ -- ** in a paragraph.'
        .matchAll(regXWordDigitRestTokens)
      )
      .reduce(collectWordsDigitsAndRest, {});
    
    console.log({
      wordsMoreThanThreeLetters,
      words,
      numbers,
      symbols,
    });
    .as-console-wrapper { min-height: 100%!important; top: 0; }