I have a file with content similar to below:
Really my data is here, and I think its really
cool. Somewhere, i want to break on some really
awesome data. Please let me really explain what is going
'\n'
on. You are amazing. Something is really awesome.
Please give me the stuffs.
I would like to create an array, of string pointers to the strings between the delimiting words.
char **strings:
my data is here, and I think its
cool. Somewhere, i want to break on some
awesome data. Please let me
explain what is going'\n'on. You are amazing. Something is
awesome.'\n'Please give me the stuffs.
Code Attempted:
char *filedata = malloc(fileLength);
fread(filedata, end, 1, fp); //ABC
size_t stringCount = 8;
size_t idx = 0;
char **data = malloc(stringCount * sizeof(*packets));
if(!data) {
fprintf(stderr, "There was an error");
return 1;
}
fread(data, end, 1, text);
char *stuff = strtok(data, "really");
while(stuff) {
data[idx++] = strdup(stuff);
s = strtok(NULL, "stuff");
if(idx >= stringCount) {
stringCount *= 2;
void *tmp = realloc(stuff, stringCount * sizeof(*stuff));
if(!tmp) {
perror("Unable to make a larger string list");
stringCount /= 2;
break;
}
stuff = tmp;
}
}
This provides somewhat of what im looking for, but it doesnt delimit on the word itself rather than the letters.
There are some subtle difficulties in your goal of tokenizing a "file" on a word "really"
. What are they? Text files are generally read one line-at-a-time and, if storing and entire file of lines, as a number of pointers each pointing to the beginning of a line. Meaning, if a general line oriented approach is taken to reading the file, your tokens (beginning at the start of file, or with the word "really"
) may span multiple lines. So to tokenize, you would need to combine multiple lines.
Alternatively, you could read the entire file into a single buffer and then use strstr
to parse for your delimiter "really"
, but..., you will need to insure the buffer holding the file is nul-terminated to avoid undefined behavior of the final call to strstr
. (normally reading an entire file into a buffer does not result in a nul-terminated buffer)
That said, even with strstr
you will have to effectively do a manual parse of the contents of the file. You will need to keep three-pointers (a start pointer to the beginning of the token, a pointer used in searching for your delimiter to handle cases where the delimiter found is a lesser included substring of a larger word, and finally an end pointer to mark the end of the token.
The scheme is fairly straight forward, your first token begins and the beginning of the file, and each subsequent token begins with the word "really"
. So you scan forward to find " really"
(note the space before " really"
), set the end pointer to the beginning of your token " really"
, copy the token to a buffer, /* do stuff with token */
, free (token);
, update your start pointer to the beginning of "really"
, set your general parsing pointer to one-past "really"
and repeat until "really"
is not found. When you exit the parsing loop, you still have to /* do stuff */
with the final token.
You can also decide what to do with the '\n'
contained within each token. For output purposes below, they are simply overwritten with ' '
. (you can add any additional criteria you like such as eliminating any trailing or intervening whitespace caused by the newline replacement, that is left to you)
Putting it altogether, you could do something similar to the following where the read of the file contents into a nul-terminated buffer is handled by the function read_file()
and the rest of the tokenizing is simply handled in main()
, e.g.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
char *read_file (char* fname, size_t *nbytes)
{
long bytes = 0;
char* file_content;
FILE *file = fopen(fname, "rb");
if (!file) /* validate file open for reading */
return NULL;
fseek (file, 0, SEEK_END); /* fseek end of file */
if ((bytes = ftell (file)) == -1) { /* get number of bytes */
fprintf (stderr, "error: unable to determine file length.\n");
return NULL;
}
fseek (file, 0, SEEK_SET); /* fseek beginning of file */
/* allocate memory for file */
if (!(file_content = malloc (bytes + 1))) { /* allocate/validate memory */
perror ("malloc - virtual memory exhausted");
return NULL;
}
/* read all data into file in single call to fread */
if (fread (file_content, 1, (size_t)bytes, file) != (size_t)bytes) {
fprintf (stderr, "error: failed to read %ld-bytes from '%s'.\n",
bytes, fname);
return NULL;
}
fclose (file); /* close file */
file_content[bytes] = 0; /* nul terminate - to allow strstr use */
*nbytes = (size_t)bytes; /* update nbytes making size avialable */
return file_content; /* return pointer to caller */
}
int main (int argc, char **argv) {
size_t nbytes;
char *content;
if (argc < 2) { /* validate required argument givent */
fprintf (stderr, "error: insufficient input. filename req'd.\n");
return 1;
}
if ((content = read_file (argv[1], &nbytes))) { /* read/validate */
char *sp = content, /* start pointer for token */
*p = sp, /* pointer for parsing token */
*ep = p; /* end pointer one past end of token */
const char *delim = " really"; /* delimiter */
while ((ep = strstr (p, delim))) { /* while delimiter found */
if (isspace (*(ep + sizeof delim - 1)) || /* if next isspace */
ispunct (*(ep + sizeof delim - 1))) { /* or next ispunct */
/* delimiter found */
size_t tlen = ep - sp; /* get token length */
char *token = malloc (tlen + 1), /* allocate for token */
*tp = token; /* pointer to token */
if (!token) { /* validate allocation */
perror ("malloc-token");
exit (EXIT_FAILURE);
}
memcpy (token, sp, tlen); /* copy to token */
*(token + tlen) = 0; /* nul-termiante */
while (*tp) { /* replace '\n' with ' ' */
if (*tp == '\n')
*tp = ' ';
tp++;
}
printf ("\ntoken: %s\n", token); /* output token */
/* do stuff with token */
free (token); /* free token memory */
sp = ep + 1; /* advance start to beginning of next token */
}
p = ep + sizeof delim; /* advance pointer */
}
p = sp; /* use p to change '\n' to ' ' in last token */
while (*p) { /* replacement loop */
if (*p == '\n')
*p = ' ';
p++;
}
printf ("\ntoken: %s\n", sp);
/* do stuff with last token */
free (content); /* free buffer holding file */
}
return 0;
}
Example Input File
$ cat dat/breakreally.txt
my data is here, and I think its really
cool. Somewhere, i want to break on some really
awesome data. Please let me really explain what is going
on. You are amazing.
Example Use/Output
$ ./bin/freadbreakreally dat/breakreally.txt
token: my data is here, and I think its
token: really cool. Somewhere, i want to break on some
token: really awesome data. Please let me
token: really explain what is going on. You are amazing.
Look things over and let me know if you have any questions.