We are using Sep to process CSV files before a bulk import into a database. The process is fast, but Sep trades some error correction for raw speed.
One of our vendors is submitting CSV files with an invalid escape character (using \"
inside an otherwise properly escaped string field). Sep does not gracefully handle this, the vendor is not fixing their process yet, so the third option I see is to try and correct that data during ingestion. The API for Sep can read from a Stream
, TextReader
, byte[]
, or take a filename directly and do its own reading.
What would be the correct way to implement a filter simliar to this signature? (I am aware this is not at all valid C#). The idea is this filter would sit between the raw, potentially-broken file on disk and the TextReader
overload that Sep can read from. I have ideas how to inefficiently implement this, but anything I'm doing with string parsing immediately tanks the performance of the application.
private TextReader FilterData(TextReader raw)
{
return raw.Replace("\\"", "");
}
A minimal reproduction of the problem is here: https://github.com/bryanboettcher/SepCrash, but I'm not trying to focus on Sep. Sep is behaving correctly.
The data itself is reproduced in the repo, from a fakenamegenerator mass datafile. Line 3 is the offending line, and is extremely representative of our actual data.
Number,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,State,StateFull,ZipCode
1,female,American,Ms.,Danette,J,Bradham,"1385 Foley Street","Fort Lauderdale",FL,Florida,33308
2,female,American,Mrs.,Karla,F,Gandy,"2184 Irish Lane","La Crosse",WI,Wisconsin,54601
3,male,American,Mr.,Kevin,V,Jenkins,"1365\" Rosemont Avenue","Los Angeles",CA,California,90031
4,female,American,Ms.,Vickie,N,Howard,"3847 Java Lane",Columbia,SC,"South Carolina",29201
5,female,American,Ms.,Bonnie,S,Winship,"3389 Shadowmar Drive","New Orleans",LA,Louisiana,70118
Line 3, Address is "1365\" Rosemont Avenue"
. For proper CSV formatting, it should be "1365\"" Rosemont Avenue"
. We can handle the errant quote-slash in application code, but only after it's been parsed by the library. The offending field is actually an address, and will be cleaned up by an address validation step later. For this purpose, either "1365 Rosemont Avenue"
or "1365\"" Rosemont Avenue"
would work.
You can implement a Stream
that wraps the input stream. This isn’t optimized like Sep, but it’s also easy compared to something more optimized while not being particularly inefficient (doesn’t buffer the whole input to memory, doesn’t add copying or decoding).
using System;
using System.Diagnostics;
using static System.IO.Stream;
/// <summary>
/// Replaces all instances of <c>\"</c> with <c>""</c> in an ASCII-compatible stream.
/// </summary>
class ReplacementHack : Stream {
private Stream Wrapped;
/// <summary>
/// Up to one byte of buffered data that's been read from the underlying stream but not read from this stream.
/// </summary>
private byte? Buffered = null;
public override bool CanRead => true;
public override bool CanWrite => false;
public override bool CanSeek => false;
public ReplacementHack(Stream wrapped) {
this.Wrapped = wrapped;
}
public override int Read(byte[] buffer, int offset, int count) =>
this.Read(new Span<byte>(buffer, offset, count));
public override int Read(Span<byte> buffer) {
if (buffer.IsEmpty) {
return 0;
}
int r;
if (this.Buffered is byte b) {
this.Buffered = null;
buffer[0] = b;
r = checked(1 + this.Wrapped.Read(buffer.Slice(1)));
} else {
r = this.Wrapped.Read(buffer);
}
this.Transform(buffer.Slice(0, r));
return r;
}
private void Transform(Span<byte> buffer) {
if (buffer.IsEmpty) {
return;
}
const byte ASCII_BACKSLASH = 0x5c;
const byte ASCII_DQUOTE = 0x22;
// Replace all complete instances of \" in the buffer.
for (int i = 0; i < buffer.Length - 1; i++) {
if (buffer[i] == ASCII_BACKSLASH && buffer[i + 1] == ASCII_DQUOTE) {
buffer[i] = ASCII_DQUOTE;
i++;
}
}
// If the last byte of the buffer could be the start of \", peek one character ahead to resolve it.
if (buffer[buffer.Length - 1] == ASCII_BACKSLASH) {
var peek = this.Wrapped.ReadByte();
if (peek == -1) {
Debug.Assert(this.Buffered is null);
} else {
this.Buffered = (byte)peek;
if (peek == ASCII_DQUOTE) {
buffer[buffer.Length - 1] = ASCII_DQUOTE;
}
}
}
}
#region Non-seekable stream
public override long Length => throw new NotSupportedException();
public override long Position {
get { throw new NotSupportedException(); }
set { throw new NotSupportedException(); }
}
#endregion
#region Read-only stream
public override void Write(byte[] buffer, int offset, int count) => throw new NotSupportedException();
public override void Flush() {}
public override long Seek(long offset, System.IO.SeekOrigin origin) => throw new NotSupportedException();
public override void SetLength(long length) => throw new NotSupportedException();
#endregion
}
(If you find that this isn’t fast enough, as I mentioned in a comment, consider putting it between the network and the disk instead of between the disk and Sep.)
Addendum by OP with test results:
valid: Unfiltered: 396ms (raw Sep speed)
valid: Filtered: 490ms
broken: Unfiltered: Exception was thrown
broken: Filtered: 478ms
The filtered & unfiltered results on the valid file both hashed to the same value too, showing that if the invalid character sequence isn't detected, the stream is completely unmodified. The files were read once prior to any timing results, so that all runs would use files cached in RAM.