Search code examples
c#regexnumberswordsroman-numerals

Regex to match numbers written as words, digits or roman numerals


I'm trying to match a number written as a word, digit or roman numeral. Here's a bunch of samples

CHAPTER 1
CHAPTER 2
CHAPTER THREE
CHAPTER IV
CHAPTER TWENTY TWO

I'm pretty bad at regex, here's what I've got so far.

(CHAPTER (([0-9]+)|(/* words - see below */)|( /* roman - see below */)))

// words
(TWENTY|THIRTY|etc)?( |-)?(ONE|TWO|THREE|FOUR|FIVE|etc)?

// roman
(I|II|III|IV|V|etc)+

The statement catches CHAPTER 1, CHAPTER 2 and CHAPTER THREE, but tries to match IV as a word (I'm guessing its matching FIVE somehow?). TWENTY TWO Doesn't match at all.

Can anyone help? Here's the full regex

(CHAPTER (
([0-9]+)|
((TWENTY|THIRTY)?( |-)?(ONE|TWO|THREE|FOUR|FIVE)?)|
((I|II|III|IV|V)+)
))

NOTE:

The point of this is to convert these text representations to actual integers. I have methods to do this in each case, so I do need to distinguish between the various cases


Solution

  • Since you've already got parsers, which hopefully fail gracefully if given something which superficially looks like valid roman/text input but isn't, you could just call them all and see which pass.

    If you don't just want to call them all, this regex should identify which parser to pass each input to.

    var re = new Regex(
        @"CHAPTER (?:(?<arabic>\d+)|(?<roman>[IVXLCDM]+)|(?<text>[A-Z ]+))");
    

    called for example as

    var input = @"CHAPTER 1
    CHAPTER 2
    CHAPTER THREE
    CHAPTER IV
    CHAPTER TWENTY TWO";
    
    foreach (Match match in re.Matches(input))
    {
        if (match.Groups["arabic"].Success)
        {
            Console.WriteLine("Pass {0} to Arabic parser", match.Groups["arabic"].Value);
        }
        else if (match.Groups["roman"].Success)
        {
            Console.WriteLine("Pass {0} to Roman parser", match.Groups["roman"].Value);
        }
        else if (match.Groups["text"].Success)
        {
            Console.WriteLine("Pass {0} to Text parser", match.Groups["text"].Value);
        }
    }
    

    results in

    Pass 1 to Arabic parser
    Pass 2 to Arabic parser
    Pass THREE to Text parser
    Pass IV to Roman parser
    Pass TWENTY TWO to Text parser