How does one extract symbols, numbers, words with maximum 3 and words with at least 4 letters from a string and store each into an accordingly categorized array?
The given string is:
const string = 'There are usually 100 to 200 words + in a paragraph';
The expected response is:
const numbers = ['200', '100'];
const wordsMoreThanThreeLetters = ['There', 'words ', 'paragraph', 'usually'];
const symbols = ['+'];
const words = ['are', 'to', 'in', 'a'];
A valid approach was to split
the string at any whitespace-sequence and then to operate a reduce
method on the split
method's result array.
The reducer function will be implemented in a way that it collects and aggregates the string items (tokens) within specific arrays according to the OP's categories, supported by helper methods for e.g. digit and word tests ...
function collectWordsDigitsAndRest(collector, token) {
const isDigitsOnly = value => (/^\d+$/).test(token);
const isWord = value => (/^\w+$/).test(token);
const listName = isDigitsOnly(token)
? 'digits'
: (
isWord(token)
? (token.length <= 3) && 'shortWords' || 'longWords'
: 'rest'
);
(collector[listName] ??= []).push(token);
return collector;
}
const {
longWords: wordsMoreThanThreeLetters = [],
shortWords: words = [],
digits: numbers = [],
rest: symbols = [],
} = 'There are usually 100 to 200 words + in a paragraph'
.split(/\s+/)
.reduce(collectWordsDigitsAndRest, {});
console.log({
wordsMoreThanThreeLetters,
words,
numbers,
symbols,
});
.as-console-wrapper { min-height: 100%!important; top: 0; }
Of cause one also could matchAll
the required tokens by a single regular expression / RegExp
which features named capturing groups and also uses Unicode escapes in order to achieve a better internationalization (i18n) coverage.
The regex itself would look and work like this ...
... derived from ...
The reducer function of the first approach has to be adapted to this second approach in order to process each captured group accordingly ...
function collectWordsDigitsAndRest(collector, { groups }) {
const { shortWord, longWord, digit, rest } = groups;
const listName = (shortWord
&& 'shortWords') || (longWord
&& 'longWords') || (digit
&& 'digits') || (rest
&& 'rest');
if (listName) {
(collector[listName] ??= []).push(shortWord || longWord || digit || rest);
}
return collector;
}
// Unicode Categories ... [https://www.regularexpressions.info/unicode.html#category]
// regex101.com ... [https://regex101.com/r/nCga5u/2]
const regXWordDigitRestTokens =
/(?:\b(?<digit>\p{N}+)|(?<longWord>\p{L}{4,})|(?<shortWord>\p{L}+)\b)|(?<rest>[^\p{Z}]+)/gmu;
const {
longWords: wordsMoreThanThreeLetters = [],
shortWords: words = [],
digits: numbers = [],
rest: symbols = [],
} = Array
.from(
'There are usually 100 to 200 words ++ -- ** in a paragraph.'
.matchAll(regXWordDigitRestTokens)
)
.reduce(collectWordsDigitsAndRest, {});
console.log({
wordsMoreThanThreeLetters,
words,
numbers,
symbols,
});
.as-console-wrapper { min-height: 100%!important; top: 0; }