Search code examples
c++11visual-c++mfcc-strings

Implementing tokenize function with CString


For the sake of learning, I'm trying to implement my own simple Tokenize function with CStrings. I currently have this file:

11111
22222
(ENDWPT)


222222
333333
(ENDWPT)
6060606
ggggggg
hhhhhhh
(ENDWPT)
iiiiiii
jjjjjjj
kkkkkkk
lllllll
mmmmmmm
nnnnnnn

Which I would like to be tokenized with the delimiter (ENDWPT). I coded the following function, which attempts to find the delimiter position, then add the delimiter length and extract the text to this position. After that, update a counter that is used so that the next time the function is called it begins searching for the delimiter from the previous index. The function looks like this:

bool MyTokenize(CString strText, CString& strOut, int& iCount)
{
    CString strDelimiter = L"(ENDWPT)";
    int iIndex = strText.Find(strDelimiter, iCount);

    if (iIndex != -1)
    {
        iIndex += strDelimiter.GetLength();
        strOut = strText.Mid(iCount, iIndex);
        iCount = iIndex;
        return true;
    }
    return false;
}

And is being called like so:

int nCount = 0;

while ((MyTokenize(strText, strToken, nCount)) == true)
{
    // Handle tokenized strings here
}

Right now, the function is splitting the strings in the wrong way, I think it is because Find() may be returning the wrong index. I think it should be returning 12, but it is actually returning 14??. I ran out of ideas, if anyone can figure this out I would really appreciate it.


Solution

  • If delimiter is found (iIndex) then read iIndex - iCount count, starting from (iCount). Then modify iCount

    if(iIndex != -1)
    {
        strOut = strText.Mid(iCount, iIndex - iCount);
        iCount = iIndex + strDelimiter.GetLength();
        return true;
    }
    

    The source string may not end with delimiter, it needs a special case for that.

    You can also pick better names to match the usage for CString::Mid(int nFirst, int nCount) to make it easier to understand. MFC uses camelCase coding style, with type identifiers in front of variables, which is unnecessary in C++, I'll avoid it in this example:

    bool MyTokenize(CString &source, CString& token, int& first)
    {
        CString delimeter = L"(ENDWPT)";
        int end = source.Find(delimeter, first);
    
        if(end != -1)
        {
            int count = end - first;
            token = source.Mid(first, count);
            first = end + delimeter.GetLength();
            return true;
        }
        else
        {
            int count = source.GetLength() - first;
            if(count <= 0)
                return false;
    
            token = source.Mid(first, count);
            first = source.GetLength();
            return true;
        }
    }
    
    ...
    
    int first = 0;
    CString source = ...
    CString token;
    while(MyTokenize(source, token, first))
    {
        // Handle tokenized strings here
    }