Search code examples
c#regexroman-numerals

A regex for Matching I.text in c#


I'm trying to parse a PDF to XML in c# and i want to extract headings like: I. INTRODUCTION, II. PAGE LAYOUT which are categorized by roman numerals from my file. I would like to write a regex to match strings like this I tried a couple of things but doesn't work, can anyone help?


Solution

  • This should do what you need:

    [IVXLCDM]+. [A-Z ]+

    As stated here:

    \. will match a period since the period character is a special character (meaning match any character) in regular expression syntax.

    On the other hand, if you want to make sure that the string contains only Roman numerals and a heading name, you might want to use this:

    ^[IVXLCDM]+\. [A-Z ]+$
    

    The ^ and $ are called anchors. The ^ instructs the regex engine to start matching from the very beginning of the string while the $ instructs the regex engine to stop matching at the very end of the string. The complete list of Roman Numerals can be obtained from Wikipedia