Search code examples
cexcelcsvparsingstrchr

C parsing a comma-separated-values with line breaks


I have a CSV data file that have the following data:

H1,H2,H3
a,"b
c
d",e

When I open through Excel as CSV file, it is able to show the sheet with column headings as H1, H2, H3 and column values as: a for H1,

multi line value as
b
c
d
for H2 

and c for H3 I need to parse this file using a C program and have the values picked up like this. But, my following code snippet will not work, as I have multi line values for a column:

char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch; 
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
  *pch = 0; 
  strcpy(tokens[i++], ptok);
  ptok = pch+1;
}
strcpy(tokens[i++], ptok);

How to modify this code snippet to accommodate multi-line values of columns? Please don't get bothered by the hard-coded values for the string buffers, this is the test code as POC. Instead of any 3rd party library, I would like to do it the hard way from first principle. Please help.


Solution

  • The main complication in parsing "well-formed" CSV in C is precisely the handling of variable-length strings and arrays which you are avoiding by using fixed-length strings and arrays. (The other complication is handling not well-formed CSV.)

    Without those complications, the parsing is really quite simple:

    (untested)

    /* Appends a non-quoted field to s and returns the delimiter */
    int readSimpleField(struct String* s) {
      for (;;) {
        int ch = getc();
        if (ch == ',' || ch == '\n' || ch == EOF) return ch;
        stringAppend(s, ch);
      }
    }
    
    /* Appends a quoted field to s and returns the delimiter.
     * Assumes the open quote has already been read.
     * If the field is not terminated, returns ERROR, which
     * should be a value different from any character or EOF.
     * The delimiter returned is the character after the closing quote
     * (or EOF), which may not be a valid delimiter. Caller should check.
     */
    int readQuotedField(struct String* s) {
      for (;;) {
        int ch;
        for (;;) {
          ch = getc();
          if (ch == EOF) return ERROR;
          if (ch == '"') {
            ch = getc();
            if (ch != '"') break;
          }
          stringAppend(s, ch);
        }
      }
    }
    
    /* Reads a single field into s and returns the following delimiter,
     * which might be invalid.
     */
    int readField(struct String* s) {
      stringClear(s);
      int ch = getc();
      if (ch == '"') return readQuotedField(s);
      if (ch == '\n' || ch == EOF) return ch;
      stringAppend(s, ch);
      return readSimpleField(s);
    }
    
    /* Reads a single row into row and returns the following delimiter,
     * which might be invalid.
     */
    int readRow(struct Row* row) {
      struct String field = {0};
      rowClear(row);
      /* Make sure there is at least one field */
      int ch = getc();
      if (ch != '\n' && ch != EOF) {
        ungetc(ch, stdin);
        do {
          ch = readField(s);
          rowAppend(row, s);
        } while (ch == ',');
      }
      return ch;
    }
    
    /* Reads an entire CSV file into table.
     * Returns true if the parse was successful.
     * If an error is encountered, returns false. If the end-of-file
     * indicator is set, the error was an unterminated quoted field; 
     * otherwise, the next character read will be the one which
     * triggered the error.
     */
    bool readCSV(struct Table* table) {
      tableClear(table);
      struct Row row = {0};
      /* Make sure there is at least one row */
      int ch = getc();
      if (ch != EOF) {
        ungetc(ch, stdin);
        do {
          ch = readRow(row);
          tableAppend(table, row);
        } while (ch == '\n');
      }
      return ch == EOF;
    }
    

    The above is "from first principles" -- it does not even use standard C library string functions. But it takes some effort to understand and verify. Personally, I would use (f)lex and maybe even yacc/bison (although it's a bit of overkill) to simplify the code and make the expected syntax more obvious. But handling variable-length structures in C will still need to be the first step.