Search code examples
cregexfgets

Using regex for checking a .dat file


I am reading a file using fgets. I need to check each line of the file against a regex. If there is a non alpha numeric character, it needs to exit the program with the line number and "bad" character displayed. What is happening is it is kicking out before the "bad" character. Here is my .dat file:

howard jim dave 
joe
(
Maggie

My output of the program is:

file opened
Digit: howard jim dave 
is not alphanumeric on line: 1
Exiting program!
File closed

What should happen is it should kick out on line 3, as you can see that is not happening.

Here is my regex which is in my main.h file:

#ifndef MAIN_H
#define MAIN_H

#ifdef  __cplusplus
extern "C" {
#endif

#define BUFF 1024
#define to_find "^[a-zA-Z0-9]+$"

Here is my fileCheck.c

#include "main.h"

int fileCheck(FILE *fp)
{

    int ret_val;
    int line_count = 0;
    char file[BUFF];
    regex_t regex;

    if (regcomp(&regex, to_find, REG_EXTENDED) != 0)
    {
        fprintf(stderr, "Failed to compile regex '%s'\n", to_find);
        return EXIT_FAILURE;
    }

    if (fp != NULL)
    {
        while (fgets(file, BUFF, fp))
        {
            line_count++;

            if ((ret_val = regexec(&regex, file, 0, NULL, 0)) != 0)
            {
                printf("Digit: %s is not alphanumeric on line: %d\n", file, line_count);
                printf("Exiting program!\n");
                return EXIT_FAILURE;
            }
        }
    }

}

I am not sure if the "\n" character is the problem or not. I do not think it is. I am well aware if isalnum() but I am tasked with a regex. What would be a possible solution for this problem? Thank you for your suggestions.

EDIT: I wanted to mention that when I used fscanf instead of fgets, the above regex worked just fine. The reason for the change is I need to count each line. If I am correct, fscanf ignores newline character. I need some way to count a newline. Is it possible to count a new using fscanf? My original file read loop was:

while (fscanf(fp, "%11023s", file) != EOF
{
    line_count++;
    if (regexec(&regex, file, 0, NULL, 0) != 0)
    {
        printf("%s%d wrong:\n, file, line_count);
        return EXIT_FAILURE;
    }
}

Solution

  • howard jim dave contains whitespaces.

    Edit3:
    The reason I focused on a match that looks only for valid lines was that you seemed to
    be using a simple test scenario that later will be more comples.
    However, if this is just what you need it for, the real solution is to just look for
    a non-alphanumeric non-whitespace character.
    If the regex flavor you are using require a match from beginning to end,
    this won't work.

      #define to_find "[^a-zA-Z0-9\\s]" 
         or, 
      #define to_find "[^a-zA-Z0-9\\ \\t\\f\\r\\n]"
    
       . . .
         Then down here if the regex matches, it found non alpha numeric
    
      if ( regexec(&regex, file, 0, NULL, 0)) == 0 )
      {
          printf("Digit: %s is not alphanumeric on line: %d\n", file, line_count);
          printf("Exiting program!\n");
          return EXIT_FAILURE;
      }
    

    Edit2:
    Is this a Posix engine? What error code does regcomp() return? You should set REG_EXTENDED as one of the cflag parameters.
    Unfortunately the (?: pattern ) construct is an extended specification.

    Might as well throw the kitchen sink at it
    REG_EXTENDED | REG_NEWLINE

    Try those flaqs and plop
    "^\\s*[a-zA-Z0-9]+(?:\\s+[a-zA-Z0-9]+)*\\s*$" directly into regcomp()

    This can help with the error code:

     int res_compile = 0;
     if ( (res_compile=regcomp(&regex, to_find, REG_EXTENDED) ) != 0)
     {
       fprintf(stderr, "Failed to compile regex '%s'\nError code:  %d\n", to_find, res_compile);
     }
    

    Original: Maybe you need

     # ^\s*[a-zA-Z0-9]+(?:\s+[a-zA-Z0-9]+)*\s*$
    
     ^ 
     \s* 
     [a-zA-Z0-9]+ 
     (?: \s+ [a-zA-Z0-9]+ )*
     \s* 
     $
    

    Or

     # \A[^\S\r\n]*[a-zA-Z0-9]+(?:[^\S\r\n]+[a-zA-Z0-9]+)*\s*\z
    
     \A 
     [^\S\r\n]* 
     [a-zA-Z0-9]+ 
     (?: [^\S\r\n]+ [a-zA-Z0-9]+ )*
     \s*
     \z