Search code examples
regextextdata-cleaning

What is a regex expression to find three asterisks with whitespace on either side: " *** "?


My goal is to use a regular expression in order to discard the header and footer information from a Project Gutenberg UTF-8 encoded text file.

Each book contains a 'start line' like so:

[...]
Character set encoding: UTF-8

Produced by: Emma Dudding, John Bickers, Dagny and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***




Grimms’ Fairy Tales

By Jacob Grimm and Wilhelm Grimm
[...]

The footers look pretty similar:

Taylor, who made the first English translation in 1823, selecting about
fifty stories ‘with the amusement of some young friends principally in
view.’ They have been an essential ingredient of children’s reading ever
since.




*** END OF THE PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***

Updated editions will replace the previous one--the old editions will
be renamed.

My idea is to use these triple asterisk markers to discard headers and footers, since such an operation is useful for any Gutenberg release.

What is a good way to do this with regex?


Solution

  • What is a good way to do this with regex?

    To find a white space you can use ' ' or '\s'(note: \s will match all white space chars like \n, \r etc.

    To find * , you will have to escape it like: \* since * in regex means zero or more repetitions.

    To check if * is repeated three times, you can escape it three times or use quantifier like \*{3}

    So your regex could look like: \*{3} This will match every time three * are found.

    To match everything between three *, like in the header and footer. You can modify the regex to:

    ^\*{3}[\w\W]*?\*{3}$
    
    This means: 
    ^         - beginning of the line
    \*{3}     - match three *
    [\w\W]*?  - match every alphanumeric and non alphanumric chars
    \*{3}     - match three *
    $         - end of the line
    

    Test here: https://regex101.com/r/d8dcHf/1

    PS: I think this regex can be optimized or maybe a better one can be created.