Search code examples
stringexcelvbadna-sequence

Need info on Excel and VBA string processing and storage limits -- and suggested workarounds


According to this Microsoft blog post the max string length per cell is 32k for MS Office 2010; I've also confirmed this via testing. Problem is that I've got strings (DNA sequences) that are well above that length, and I'm matching sub-sequences of DNA on the whole 32k+ sequence that could match anywhere on the main sequence; meaning that I can not simple break the main sequence out into 32k chucks, since I need to be able to string match the "sub-string-sequence" to the entire "main-string-sequence". One thing that is not clear is if VBA supports processing of strings large than 32k, if VBA supports string concatenation in excess of 32k that might be a work around; meaning I chunk the "main-string-sequence" into 32k chunks on a row to the N-th column, then when I need to process a match just concatenate the strings in a row from column 1 to the N-th, process the match, then dump the temp "main-string-sequence" being stored in VBA.

So, basically the issue is MS-Office 2010 only supports strings per cell up to 32k in length, and I've got strings that are much larger than that which need to be processed in whole form in order for the string matching to work.


Solution

  • VBA dynamic strings (Dim x As String) can hold 2^31 characters, so you should be good there. I'm not sure why you are writing them to cells. Unless you need to for some other reason, I would do it all in VBA. It's not clear where you're getting the strings.

    Here's some code that makes a fake 40,000 character DNA sequence and a fake 5 character substring, then finds where one is inside the other.

    Sub FindDNASubString()
    
        Dim lRnd As Long
        Dim i As Long
        Dim sMain As String
        Dim sSub As String
        Dim vaLetters As Variant
        Dim lPos As Long
    
        Const lUPPER As Long = 3
        Const lLOWER As Long = 0
    
        vaLetters = Array("A", "C", "T", "G")
    
        'Create fake main string
        For i = 1 To 40000
            lRnd = Int((lUPPER - lLOWER + 1) * Rnd + lLOWER)
            sMain = sMain & vaLetters(lRnd)
        Next i
    
        'create fake substring
        For i = 1 To 5
            lRnd = Int((lUPPER - lLOWER + 1) * Rnd + lLOWER)
            sSub = sSub & vaLetters(lRnd)
        Next i
    
        'find position of sub in main
        lPos = InStr(1, sMain, sSub)
    
        MsgBox "Substring is at position " & lPos
    
    End Sub