Search code examples
excelvbaexcel-formulaword-countcharactercount

Is there a way to count the number of characters per word for a string, returning values separated by a comma?


I have a list of strings in cells - 1000s of them - and I need to work out the characters per word but separated by word - preferably in 1 swift formula...

For Example: 1. "Black Cup With Handle" > Formula I need > 5,3,4,6

  1. "Giant Bear Statue" > Formula I need > 5,4,6

I need this for a recurring task which has been macro'd in a very inefficient way to count words into columns (of which we need to use up to 20 for the just encase) but this needs to be tackled.

Usually, we count the spaces and layer nested serach() formulas to piggyback onto one and other to break down the structure then character counts the individual words...

I could alternatively the macro to substitute the spaces for commas and used text to columns but that still leaves me with a prolonged counting process for what im looking for

we obviously use =LEN(A1)-LEN(SUBSTITUTE(A1," ","")) to count the spaces in the word

we currently then use =SEACRH() function combined with =MID() functions (and some bizarre numbers) to reveal each word into its own individual cell

then =LEN once again bu on all individual words - very long-winded

Im hoping to find a shorter way to do this but feeling there may not be a dynamic enough way to do it with formula alone, hoping someone can prove me wrong!


Solution

  • You'll have different options depending on your Excel version.


    OPTION 1: TEXTJOIN

    I think you are looking for a TEXTJOIN function. Just bare in mind that you can only use this the more later versions of Excel (see link to documentation) and it could work like this:

    enter image description here

    Formula in B1:

    =TEXTJOIN(",",TRUE,LEN(FILTERXML("<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>","//s")))
    

    NOTE: It's an array formula and you need to enter it using CtrlShiftEnter

    To make it so that you won't need to use the above key-combo, we can include an INDEX:

    =TEXTJOIN(",",TRUE,INDEX(LEN(FILTERXML("<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>","//s")),))
    

    Additional Information:

    FILTERXML

    This function takes (as per documentation) two required arguments:

    • A string in valid XML
    • A string in valid XPath

    Because we want to return an array of elements (words) from the cell, we need to SUBSTITUTE the spaces for end-tags (</..>) and concatenate that with a start-tag (<..>) at the start of the string and another end-tag at the end.

    I'll have to rely on an XML explaination on the tags as to why <?><?> works and it's meaning, because as far as my testing goes I could swap the letters around or replace by another letter with the same results as long as the final Xpath would resemble the same character. It would be great if someone would be able to complement this answer with a better explanation on this matter.

    For more FILTERXML "tricks", have a look here


    TEXTJOIN

    If you are a Office 365 subscriber or own Excel 2019 you can make use of this function. There are (as per documentation) at least 3 required arguments:

    • A delimiter which must be a text string, either empty, or one or more characters enclosed by double quotes, or a reference to a valid text string. If a number is supplied, it will be treated as text.
    • The second argument can hold either TRUE or FALSE and determines whether or not you want to exclude/include empty values
    • The third argument is the text item to be joined. A text string, or array of strings, such as a range of cells.

    Now this is where we can join the two functions together, FILTERXML returning an array which we can use in TEXTJOIN.


    INDEX + LEN

    I'll have to explain the use of these functions together. I don't think LEN and INDEX will need much of an introduction on their own, but together they work quite nicely. Natively there will be a force called implicit intersection that will prevent LEN from returning an array of values when you pass an array of values to the function, in this case through our FILTERXML.

    Normally you would disable this mechanism using a key combination of: CtrlShiftEnter, better known as CSE.

    Now what INDEX does is disabling this implicit intersection making LEN able to return an array, removing the need to CSE the formula. INDEX is one of the functions that has this "power". A more in depth explanation on implicit intersection can be found here


    OPTION 2: UDF

    Without access to TEXTJOIN I think you'll need to have a look at using an UDF, possibly looking like below:

    Function TEXTJOIN(rng As Range) As String
        TEXTJOIN = Join(Application.Evaluate("LEN({""" & Join(Split(rng, " "), """,""") & """})"), ",")
    End Function
    

    You can call this in B1 like so: =TEXTJOIN(A1)


    Additional Information:

    The UDF consists out of three main mechanisms that work together:

    JOIN

    This funciton takes two parameters, where the first one is required:

    • First parameter is a one-dimensional array containing substrings
    • The second (optional) parameter is a string character used to separate the substrings in the returned string. If omitted, the space character (" ") is used. If delimiter is a zero-length string (""), all items in the list are concatenated with no delimiters.

    The function returns a string value


    SPLIT

    This function takes a string and delimits it by a specified character/substring. It takes the following arguments:

    • 1st: A required string expression containing substrings and delimiters. If expression is a zero-length string(""), Split returns an empty array, that is, an array with no elements and no data.
    • 2nd: The optional delimiter which is a string character used to identify substring limits. If omitted, the space character (" ") is assumed to be the delimiter. If delimiter is a zero-length string, a single-element array containing the entire expression string is returned.
    • 3rd: An optional limit, a number of substrings to be returned; -1 indicates that all substrings are returned.
    • 4th: Compare, also optional, is a numeric value indicating the kind of comparison to use when evaluating substrings. See Settings section for values.

    In this case we would only need the first two arguments.


    Application.Evaluate

    This is IMO one of the most handy mechanisms you can use to pull of a returned array of values without having to loop through items/cells. It may get slow when you feed the function a large array formula, but in this case it will be fine. The funtion converts a Microsoft Excel name into an object or value, and when we pass it an formula, it thus will return the results. In this particular case it will return an array.