Search code examples
vbaexceltrie

Is there any way I can speed up this VBA algorithm?


I am looking to implement a VBA trie-building algorithm that is able to process a substantial English lexicon (~50,000 words) in a relatively short amount of time (less than 15-20 seconds). Since I am a C++ programmer by practice (and this is my first time doing any substantial VBA work), I built a quick proof-of-concept program that was able to complete the task on my computer in about half a second. When it came time to test the VBA port however, it took almost two minutes to do the same -- an unacceptably long amount of time for my purposes. The VBA code is below:

Node Class Module:

Public letter As String
Public next_nodes As New Collection
Public is_word As Boolean

Main Module:

Dim tree As Node

Sub build_trie()
    Set tree = New Node
    Dim file, a, b, c As Integer
    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file
    Do While Not EOF(file)
        Dim line As String
        Line Input #file, line
        wordlist.add line
    Loop
    For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wordlist.Item(a))
            Dim match As Boolean
            match = False
            Dim char As String
            char = Mid(wordlist.Item(a), b, 1)
            For c = 1 To current.next_nodes.Count
                If char = current.next_nodes.Item(c).letter Then
                    Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next c
            If Not match Then
                Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next a
End Sub

My question then is simply, can this algorithm be sped up? I saw from some sources that VBA Collections are not as efficient as Dictionarys and so I attempted a Dictionary-based implementation instead but it took an equal amount of time to complete with even worse memory usage (500+ MB of RAM used by Excel on my computer). As I say I am extremely new to VBA so my knowledge of both its syntax as well as its overall features/limitations is very limited -- which is why I don't believe that this algorithm is as efficient as it could possibly be; any tips/suggestions would be greatly appreciated.

Thanks in advance

NB: The lexicon file referred to by the code, "corncob_caps.txt", is available here (download the "all CAPS" file)


Solution

  • There are a number of small issues and a few larger opportunities here. You did say this is your first vba work, so forgive me if I'm telling you things you already know

    Small things first:
    Dim file, a, b, c As Integer declares file, a and b as variants. Integer is 16 bit sign, so there may be risk of overflows, use Long instead.

    DIM'ing inside loops is counter-productive: unlike C++ they are not loop scoped.

    The real opportunity is:

    Use For Each where you can to iterate collections: its faster than indexing.

    On my hardware your original code ran in about 160s. This code in about 2.5s (both plus time to load word file into the collection, about 4s)

    Sub build_trie()
        Dim t1 As Long
        Dim wd As Variant
        Dim nd As Node
    
        Set tree = New Node
        ' Dim file, a, b, c As Integer  : declares file, a, b as variant
        Dim file As Integer, a As Long, b As Long, c As Long     ' Integer is 16 bit signed
    
        Dim current As Node
        Dim wordlist As Collection
        Set wordlist = New Collection
        file = FreeFile
        Open "C:\corncob_caps.txt" For Input As file
    
        ' no point in doing inside loop, they are not scoped to the loop
        Dim line As String
        Dim match As Boolean
        Dim char As String
        Dim new_node As Node
    
        Do While Not EOF(file)
            'Dim line As String
            Line Input #file, line
            wordlist.Add line
        Loop
    
    
        t1 = GetTickCount
        For Each wd In wordlist ' for each is faster
        'For a = 1 To wordlist.Count
            Set current = tree
            For b = 1 To Len(wd)
                'Dim match As Boolean
                match = False
                'Dim char As String
                char = Mid$(wd, b, 1)
                For Each nd In current.next_nodes
                'For c = 1 To current.next_nodes.Count
                    If char = nd.letter Then
                    'If char = current.next_nodes.Item(c).letter Then
                        Set current = nd
                        'Set current = current.next_nodes.Item(c)
                        match = True
                        Exit For
                    End If
                Next nd
                If Not match Then
                    'Dim new_node As Node
                    Set new_node = New Node
                    new_node.letter = char
                    current.next_nodes.Add new_node
                    Set current = new_node
                End If
            Next b
            current.is_word = True
        Next wd
    
        Debug.Print "Time = " & GetTickCount - t1 & " ms"
    End Sub
    

    EDIT:

    loading the word list into a dynamic array will reduce load time to sub second. Be aware that Redim Preserve is expensive, so do it in chunks

        Dim i As Long, sz As Long
        sz = 10000
        Dim wordlist() As String
        ReDim wordlist(0 To sz)
    
        file = FreeFile
        Open "C:\corncob_caps.txt" For Input As file
    
        i = 0
        Do While Not EOF(file)
            'Dim line As String
            Line Input #file, line
            wordlist(i) = line
            i = i + 1
            If i > sz Then
                sz = sz + 10000
                ReDim Preserve wordlist(0 To sz)
            End If
            'wordlist.Add line
        Loop
        ReDim Preserve wordlist(0 To i - 1)
    

    then loop through it like

        For i = 0 To UBound(wordlist)
            wd = wordlist(i)