Search code examples
regexvb.netdelimiterdelimited

Changing a pipe delimited file to comma delimited in VB.net


So I have a set of pipe delimited inputs which are something like this:

"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682 | 82524 | 6846419 | 68247

and I am converting them to comma delimited using the code given below:

 Dim line As String
    Dim fields As String()
    Using sw As New StreamWriter("c:\test\output.txt")
        Using tfp As New FileIO.TextFieldParser("c:\test\test.txt")
            tfp.TextFieldType = FileIO.FieldType.Delimited
            tfp.Delimiters = New String() {"|"}
            tfp.HasFieldsEnclosedInQuotes = True
            While Not tfp.EndOfData
                fields = tfp.ReadFields
                line = String.Join(",", fields)
                sw.WriteLine(line)
            End While
        End Using
    End Using

So far so good. It only considers the delimiters that are present outside the quotes and changes them to the comma delimiter. But trouble starts when I have input with a stray quotation like below:

"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682 | "82524 | 6846419 | 68247

Here the code gives

MalformeLineExcpetion

Which I realize is due to the stray quotation in my input and since i am like a total noob in RegEx so i am not able to use it here(or I am incapable of). If anyone has any idea, it would be much appreciated.


Solution

  • Here is the coded procedure described in the comments:

    • Read all the lines of the original input file,
    • fix the faulty lines (with Regex or anything else that fits),
    • use TextFieldParser to perform the parsing of the correct input
    • Join() the input parts created by TextFieldParser using , as separator
    • save the fixed, reconstructed input lines to the final output file

    I'm using Wiktor Stribiżew Regex pattern: it looks like it should work given the description of the problem.

    Note:
    Of course I don't know whether a specific Encoding should be used.
    Here, the Encoding is the default UTF-8 no-BOM, in and out.

    "FaultyInput.txt" is the corrupted source file.
    "FixedInput.txt" is the file containing the input lines fixed (hopefully) by the Regex. You could also use a MemoryStream.
    "FixedOutput.txt" is the final CSV file, containing comma separated fields and the correct values.

    These files are all read/written in the executable startup path.

    Dim input As List(Of String) = File.ReadAllLines("FaultyInput.txt").ToList()
    For line As Integer = 0 To input.Count - 1
        input(line) = Regex.Replace(input(line), "(""\b.*?\b"")|""", "$1")
    Next
    
    File.WriteAllLines("FixedInput.txt", input)
    
    Dim output As List(Of String) = New List(Of String)
    Using tfp As New FileIO.TextFieldParser("FixedInput.txt")
        tfp.TextFieldType = FileIO.FieldType.Delimited
        tfp.Delimiters = New String() {"|"}
        tfp.HasFieldsEnclosedInQuotes = True
        While Not tfp.EndOfData
            Dim fields As String() = tfp.ReadFields
            output.Add(String.Join(",", fields))
        End While
    End Using
    
    File.WriteAllLines("FixedOutput.txt", output)
    'Eventually...
    'File.Delete("FixedInput.txt")