Search code examples
c#html.netcsvcsvhelper

Need help parsing HTML markup with CSVHelper through my .NET API


I receive a CSV from an third party in this format:

Job Requisition Id,Brand,Is Deleted,Internal Status,Job Code,Job Title-JobRequisitionLocale,Posting Start Date-JobRequisitionPosting,Posting End Date-JobRequisitionPosting,Job Posting Status-JobRequisitionPosting,Minimum Salary,Maximum Salary,Job Board Id-JobRequisitionPosting,Postal Code,Job Description-JobRequisitionLocale
1,TEST,TEST,TEST,TEST,TEST,2024-07-16T11:41:50Z,2024-07-31T00:59:59Z,TEST,00.00,00,00,TEST,TEST,"<p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p"

I have shortened the HTML markup, it's much longer. However, I'm trying to read ALL the HTML markup that's at the end of each row, to store in the Job Description column (the last column) as just plain text to store in a DB.

But it seems to keep ending the row at the first comma in the quoted HTML markup after "Calibri": <span style="font-family:Calibri,

This is my controller code:

var jobPositions = new List<JobAdvertModel>();

await foreach (var blobItem in containerClient.GetBlobsAsync())
{
    var blobClient = containerClient.GetBlobClient(blobItem.Name);
    var blobDownloadInfo = await blobClient.DownloadAsync();

    using (var streamReader = new StreamReader(blobDownloadInfo.Value.Content))
    using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        Delimiter = ",",
        BadDataFound = context =>
        {
            // Handle or log bad data
            Console.WriteLine($"Bad data found: {context.RawRecord}");
        }
    }))
    {
        csvReader.Context.RegisterClassMap<JobAdvertMap>();
        var records = csvReader.GetRecords<JobAdvertModel>().ToList();
        jobPositions.AddRange(records);
    }
}

return Ok(jobPositions);

(Removed the Azure bits above, but it's reading from an azure blob that is ingested weekly)

I'm trying to get all the HTML markup into 1 column as plain text to store in a DB.


Solution

  • As already pointed out in the comments, your CSV is malformed since quotes are not properly escaped with "". Also the HTML seems to be malformed as well as the </p tag is not properly closed (should be </p>). Having said that, the best option would certainly be to correct this on the producer side and therefore actually get proper data.

    Assuming that's not an option and you always get the same malformed data you could just correct the CSV by properly escaping it and then continue to use the existing CSV parser. You could also correct the HTML, if required, which is not done here.

    using System.Globalization;
    using System.Text.Json;
    using CsvHelper;
    using CsvHelper.Configuration;
    
    namespace MyProgram;
    
    public class Program
    {
        record Test(int LineIndex, string A, string B, string C, string D, string E, string F, string G, string H, string I, string J, string K, string L, string M,
                    string Html);
        public static async Task Main()
        {
            var toBeParsed = """
            0,Job Requisition Id,Brand,Is Deleted,Internal Status,Job Code,Job Title-JobRequisitionLocale,Posting Start Date-JobRequisitionPosting,Posting End Date-JobRequisitionPosting,Job Posting Status-JobRequisitionPosting,Minimum Salary,Maximum Salary,Job Board Id-JobRequisitionPosting,Postal Code,Job Description-JobRequisitionLocale
            1,TEST,TEST,TEST,TEST,TEST,2024-07-16T11:41:50Z,2024-07-31T00:59:59Z,TEST,00.00,00,00,TEST,TEST,"<p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p"
            """;
            var correctCsv = CorrectCsv(toBeParsed);
            var config = new CsvConfiguration(CultureInfo.InvariantCulture)
            {
                HasHeaderRecord = false, // just to make it work with the provided "Test" record
                NewLine = Environment.NewLine,
            };
            using var reader = new StringReader(correctCsv);
            using var csvReader = new CsvReader(reader, config);
            var parsedCsv = csvReader.GetRecords<Test>().ToArray();
            Console.WriteLine(JsonSerializer.Serialize(parsedCsv, new JsonSerializerOptions { WriteIndented = true }));
            var lastLineLastRow = parsedCsv[^1].Html;
            Console.WriteLine(lastLineLastRow);
        }
    
        private static string CorrectCsv(string toBeParsed)
        {
            var lastSeparatorIndex = toBeParsed.LastIndexOf(",\"<p"); // search from back for last occurence of ",<p" within the string
            var html = toBeParsed[(lastSeparatorIndex + 2)..].TrimEnd('"'); // extract HTML from string
            var escapedHtml = html.Replace("\"", "\"\""); // escape "
            return toBeParsed[..lastSeparatorIndex] + ",\"" + escapedHtml + "\""; // concatenate new correctly escaped CSV
        }
    }
    

    Expected output:

    [
      {
        "LineIndex": 0,
        "A": "Job Requisition Id",
        "B": "Brand",
        "C": "Is Deleted",
        "D": "Internal Status",
        "E": "Job Code",
        "F": "Job Title-JobRequisitionLocale",
        "G": "Posting Start Date-JobRequisitionPosting",
        "H": "Posting End Date-JobRequisitionPosting",
        "I": "Job Posting Status-JobRequisitionPosting",
        "J": "Minimum Salary",
        "K": "Maximum Salary",
        "L": "Job Board Id-JobRequisitionPosting",
        "M": "Postal Code",
        "Html": "Job Description-JobRequisitionLocale"
      },
      {
        "LineIndex": 1,
        "A": "TEST",
        "B": "TEST",
        "C": "TEST",
        "D": "TEST",
        "E": "TEST",
        "F": "2024-07-16T11:41:50Z",
        "G": "2024-07-31T00:59:59Z",
        "H": "TEST",
        "I": "00.00",
        "J": "00",
        "K": "00",
        "L": "TEST",
        "M": "TEST",
        "Html": "\u003Cp class=\u0022MsoNoSpacing\u0022\u003E\u003Cspan style=\u0022font-size:11pt\u0022\u003E\u003Cspan style=\u0022font-family:Calibri,sans-serif\u0022\u003E\u003Cb\u003E\u003Cspan lang=\u0022EN-US\u0022 style=\u0022font-size:12.0pt\u0022\u003ERole: \u003C/span\u003E\u003C/b\u003E\u003C/p"
      }
    ]
    <p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p
    

    Please note: I had to make a simple adjustment in your data, i.e., adding a 0 as the line index for the first row and so that I could disable HasHeaderRecord in order to use my simple Test record, otherwise the CSV library would throw an error since no matching properties to the header names are found.

    You can make the whole thing a whole lot more performant by using Span<char>s if required.

    Alternatively, you could of course also write your own CSV parser that can deal with/expects malformed data.