I'm trying to parse a PDF to XML in c# and i want to extract headings like: I. INTRODUCTION, II. PAGE LAYOUT which are categorized by roman numerals from my file. I would like to write a regex to match strings like this I tried a couple of things but doesn't work, can anyone help?
This should do what you need:
[IVXLCDM]+. [A-Z ]+
As stated here:
\. will match a period since the period character is a special character (meaning match any character) in regular expression syntax.
On the other hand, if you want to make sure that the string contains only Roman numerals and a heading name, you might want to use this:
^[IVXLCDM]+\. [A-Z ]+$
The ^
and $
are called anchors. The ^
instructs the regex engine to start matching from the very beginning of the string while the $
instructs the regex engine to stop matching at the very end of the string.
The complete list of Roman Numerals can be obtained from Wikipedia