Search code examples
.netstreamreader

Read with StreamReader up to a certain string


I'm writing a file parser in a .NET application that reads the file with a StreamReader. The file to be parsed starts with a header that ends with "<eoh>". I want to either read or ignore everything from the start until that string. The actual data starts after that.

The file is not line based. Everything is spearated only by such marker strings. So I cannot use ReadLine.

How can I do that without reading one character at a time and implementing a state machine to recognise the marker work characters? I'm specifically looking for a method like StreamReader.SkipUntilAfter(string) or StreamReader.ReadUntil(string).

Oh, and this project is still using .NET 2.0, so LINQ is not desired here. Although I could probably resolve that if somebody suggests using it.


Solution

  • TextReaders generally do already read just character by character. They use a buffer so that that's faster, but a buffer to the StreamReader isn't any different than just reading ahead and pulling only until the <eoh>. There will also be no better way to skip until after that header, for the same reason. The absolute best-case scenario would be a built-in function that simply visually abstracts the underlying code, so that isn't particularly useful.

    In case you don't believe me for whatever reason, here's the source code.

    Also, it's worth noting that you'll have to look character-by-character no matter what. Even if you had a way of pulling them into memory without doing so, comparing two strings is a character-by-character operation. So you wouldn't be saving anything.

    Personally, I'd just go with something like this. It takes a TextReader and end-of-header string, and reads through the reader until it finds eoh. It then returns a bool for whether it found the marker or not.

    public bool SkipUntilAfterHeader(TextReader reader, string eoh)
    {
        int eohGuessIndex = 0;
        int next;
    
        while ((next = reader.Read()) != -1)
        {
            char c = (char)next;
    
            if (c == eoh[eohGuessIndex])
            {
                eohGuessIndex++;
                if (eohGuessIndex == eoh.Length)
                {
                    return true;
                }
            }
            else
            {
                eohGuessIndex = 0;
            }
        }
    
        return false;
    }
    

    I'm not sure what .NET 2.0 had or didn't have, so I wrote some stuff from scratch that may or may not have to be. But performance shouldn't be affected by that. A nice aspect of this is that you could also easily add a StringBuilder with an out parameter that would pass off the header information, in case you did want that later on.

    Then, usage is pretty simple.

    public void ReadFile(string path)
    {
        using (StreamReader reader = new StreamReader(path))
        {
            if (SkipUntilAfterHeader(reader, "<eoh>"))
            {
                // read file
            }
            else
            {
                // corrupt file
            }
        }
    }
    

    But, realistically, it might just be easier to read the whole file and return only the relevant part. It just depends on how important performance is, compared to readability.

    And in classically bad form, note that I haven't tested--or even compiled--any of this. But it should be relatively easy to fix, even if it doesn't work.