I have a CSV
data file that have the following data:
H1,H2,H3
a,"b
c
d",e
When I open through Excel as CSV file, it is able to show the sheet with column headings as H1, H2, H3
and column values as: a for H1
,
multi line value as
b
c
d
for H2
and c for H3
I need to parse this file using a C program and have the values picked up like this.
But, my following code snippet will not work, as I have multi line values for a column:
char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch;
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
*pch = 0;
strcpy(tokens[i++], ptok);
ptok = pch+1;
}
strcpy(tokens[i++], ptok);
How to modify this code snippet to accommodate multi-line values of columns? Please don't get bothered by the hard-coded values for the string buffers, this is the test code as POC. Instead of any 3rd party library, I would like to do it the hard way from first principle. Please help.
The main complication in parsing "well-formed" CSV in C is precisely the handling of variable-length strings and arrays which you are avoiding by using fixed-length strings and arrays. (The other complication is handling not well-formed CSV.)
Without those complications, the parsing is really quite simple:
(untested)
/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
for (;;) {
int ch = getc();
if (ch == ',' || ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
}
}
/* Appends a quoted field to s and returns the delimiter.
* Assumes the open quote has already been read.
* If the field is not terminated, returns ERROR, which
* should be a value different from any character or EOF.
* The delimiter returned is the character after the closing quote
* (or EOF), which may not be a valid delimiter. Caller should check.
*/
int readQuotedField(struct String* s) {
for (;;) {
int ch;
for (;;) {
ch = getc();
if (ch == EOF) return ERROR;
if (ch == '"') {
ch = getc();
if (ch != '"') break;
}
stringAppend(s, ch);
}
}
}
/* Reads a single field into s and returns the following delimiter,
* which might be invalid.
*/
int readField(struct String* s) {
stringClear(s);
int ch = getc();
if (ch == '"') return readQuotedField(s);
if (ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
return readSimpleField(s);
}
/* Reads a single row into row and returns the following delimiter,
* which might be invalid.
*/
int readRow(struct Row* row) {
struct String field = {0};
rowClear(row);
/* Make sure there is at least one field */
int ch = getc();
if (ch != '\n' && ch != EOF) {
ungetc(ch, stdin);
do {
ch = readField(s);
rowAppend(row, s);
} while (ch == ',');
}
return ch;
}
/* Reads an entire CSV file into table.
* Returns true if the parse was successful.
* If an error is encountered, returns false. If the end-of-file
* indicator is set, the error was an unterminated quoted field;
* otherwise, the next character read will be the one which
* triggered the error.
*/
bool readCSV(struct Table* table) {
tableClear(table);
struct Row row = {0};
/* Make sure there is at least one row */
int ch = getc();
if (ch != EOF) {
ungetc(ch, stdin);
do {
ch = readRow(row);
tableAppend(table, row);
} while (ch == '\n');
}
return ch == EOF;
}
The above is "from first principles" -- it does not even use standard C library string functions. But it takes some effort to understand and verify. Personally, I would use (f)lex and maybe even yacc/bison (although it's a bit of overkill) to simplify the code and make the expected syntax more obvious. But handling variable-length structures in C will still need to be the first step.