Search code examples
c++regexc++11standards-complianceautosar

Autosar standard-compilant way to use regex


I need to parse URI-like string. This URI is specific to the project and corresponds to "scheme://path/to/file", where path should be a syntactically correct path to file from filesystem point of view. For this purpose std::regex was used with pattern R"(^(r[o|w])\:\/\/(((?!\$|\~|\.{2,}|\/$).)+)$)".

It works fine but code analyzer complies that it is not compliant as $ character is not belong to the C++ Language Standard basic source character set:

AUTOSAR C++14 A2-3-1 (Required) Only those characters specified in the C++ Language Standard basic source character set shall be used in the source code.

Exception to this rule (according to Autosar Guidelines):

It is permitted to use other characters inside the text of a wide string and a UTF-8 encoded string literal.

wchar_t is prohibited by other rule, although it works with UTF-8 string (but it looks ugly and unreadable in the code, also I'm afraid it is not safe).

Could someone help me with workaround or std::regex here is not the best solution, then what would be better?

Are any other drawbacks of using UTF-8 string literal?

P.S. I need $ to be sure (on parsing phase) that path is not a directory and that it is not contain none of /../, ~, $ , so I can't just skip it.


Solution

  • I feel like making the code worse for the sake of satisfying an analyser is counterproductive and most likely violates the spirit of the guidelines, so I'm intentionally ignoring ways to address the problem that would involve building the regex string in a convoluted manner, since what you did is the best way to build such a regex string.

    Could someone help me with workaround or std::regex here is not the best solution, then what would be better?

    Option A: Write a simple validation function:

    I'm actually surprised that such strict guidelines even allow regexes in the first place. They are notoriously hard to audit, debug, and maintain.

    You could easily express the same logic with actual code, which would not only satisfy the analyser, but be more aligned with the spirit of the guidelines. On top of that it'll compile faster and probably run faster as well.

    Something along these rough lines, based on a cursory reading of your regex. (please don't just use this without running it through a battery of tests, I sure didn't):

    bool check_and_remove_path_prefix(std::string_view& path) {
      constexpr std::array<std::string_view, 2> valid_prefixes = { 
        R"(ro://)", 
        R"(rw://)"
      };
    
      for(auto p: valid_prefixes) {
        if(path.starts_with(p)) {
          path.remove_prefix(p.size());
          return true;
        }
      }
      return false;
    }
    
    bool is_valid_path_elem_char(char c) {
      // This matches your regex, but is probably wrong, as it will accept a bunch of control characters.
      // N.B. \x24 is the dollar sign character
      return c != '~' && c != '\x24' && c != '\r' && c != '\n';
    }
     
    bool is_valid_path(std::string_view path) {
      if(!check_and_remove_path_prefix(path)) { return false; }
    
      char prev_c = '\0';
      bool current_segment_empty = true;
      for(char c : path) {
        // Disallow two or more consecutive periods
        if( c == '.' && prev_c == '.') { return false; }
    
        // Disallow empty segments
        if(c == '/') {
          if(current_segment_empty) { return false; }
          current_segment_empty = true;
        }
        else {
          if(!is_valid_path_elem_char(c)) { return false; }
          current_segment_empty = false;
        }
        
        prev_c = c;
      }
    
      return !current_segment_empty;
    }
    

    Option B: Don't bother with the check

    It's hard from our point of view to determine whether that option is in the cards or not for you, but for every intent and purpose, the distinction between a badly formed path and a well-formed path that does not point to a valid file is moot.

    So just use the path as if it's valid, you should be handling the errors that would result from a badly formed path anyways.