Search code examples
excelvbaexcel-formulatext-extraction

Extract words from cell that are exactly 10 characters long and contain number and letter


I have a list of Lenovo products. I am looking for a way to extract their Product number from it.

The number may sit anywhere in the cell.
The number is 10 characters long, it always contains both numbers and letters and no special characters between them.
Sometimes the number is in brackets or has some special characters before or after (like - _ " < etc.).
I don't need them, I am looking for exactly 10 letters or digits and have at least 1 digit in it.
In the last line from the example there are two words with 10 characters - THINKCENTRE and 12Q6000AGE. THINKCENTRE doesn't contain a number, so it should be excluded.

Input Desired result
LEGION T5 (90SV003WGE) 90SV003WGE
FLEX 5 (82R700BEGE) XKLUSIV 82R700BEGE
V17-IRU 83A2001NGE 83A2001NGE
LENOVO E16 G1 21JT000HGE_WIN 11 21JT000HGE
LENOVO THINKCENTRE M70T 12Q6000AGE I7-12 12Q6000AGE
IDEAPAD 3 17ALC6 0 or empty

I tried different formulas but it gives me various results. I managed to find a way to extract the longest word, but very often this number isn't the longest and/or includes a lot of characters for cleanup.


Solution

  • Starting assumptions:

    • Substrings of 10 characters only;
    • Substrings only contain characters [A-Z0-9]
    • Contains at least one digit [0-9];
    • Contains at least one character [A-Z].

    DAF:

    You could try to split your input on anything that isn't alphanumeric. That can be achieved with a double TEXTSPLIT() rather easy. Now you can check these substrings against your criteria using some boolean structure. For example:

    enter image description here

    Formula in B2:

    =MAP(A2:A8,LAMBDA(s,LET(i,SEQUENCE(36)-1,r,TEXTSPLIT(s,TEXTSPLIT(s,BASE(i,36),,1)),TEXTJOIN(", ",,REPT(r,ISERR(-r)*(TEXTBEFORE(r&0,i)<>r)*(LEN(r)=10))))))
    

    UDF:

    You can also use an UDF applying some regular expression to extract the same as the above:

    Public Function RegexMatch(s As String) As String
    
    Static RE As Object: If RE Is Nothing Then Set RE = CreateObject("vbscript.regexp")
    
    RE.Pattern = "(?:\b|_)(?!\d+\b|[A-Z]+\b)([A-Z\d]{10})(?:\b|_)"
    RE.Global = True
    
    Set REMatches = RE.Execute(s)
    If REMatches.Count > 0 Then
        For Each Match In REMatches
            If RegexMatch = "" Then
                RegexMatch = Match.Submatches(0)
            Else
                RegexMatch = RegexMatch & ", " & Match.Submatches(0)
            End If
        Next
    Else
        RegexMatch = vbNullString
    End If
    
    End Function
    

    The pattern's explaination can be found here

    You can invoke the UDF using: =RegexMatch(A2) or embedded in a MAP(A2:A8,LAMBDA(s,RegexMatch(s))) structure.