Search code examples
excelvbaexcel-formulaworksheet-function

How to extract text within a string of text


I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.

I have a file with multiple rows (all one column). Each row has data that looks something like this:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom

2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate

etc...

What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.

What I would hope the output would look like would be something like:

297848936,297338191

168010496

etc...


Solution

  • Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/

    You would call it like this, assuming that first string is in A1:

    =RegexExtract(A1,"gi[|](\d+)[|]")
    

    Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:

    297848936, 297338191
    

    Just run this down the column and you're all done!

    Function RegexExtract(ByVal text As String, _
                          ByVal extract_what As String, _
                          Optional separator As String = ", ") As String
    
    Dim allMatches As Object
    Dim RE As Object
    Set RE = CreateObject("vbscript.regexp")
    Dim i As Long, j As Long
    Dim result As String
    
    RE.pattern = extract_what
    RE.Global = True
    Set allMatches = RE.Execute(text)
    
    For i = 0 To allMatches.count - 1
        For j = 0 To allMatches.Item(i).submatches.count - 1
            result = result & (separator & allMatches.Item(i).submatches.Item(j))
        Next
    Next
    
    If Len(result) <> 0 Then
        result = Right$(result, Len(result) - Len(separator))
    End If
    
    RegexExtract = result
    
    End Function