Search code examples
regexhaskellencodingpcre

Haskell RegEx Matching on UTF8 file


I wrote this function

module PdfParser (parseOptions) where

import Text.Regex.PCRE
import Data.List.Split

parseOptions :: String -> [String]
parseOptions s = splitOn "\n" (s =~ regex :: String)
  where 
    regex = "(?<=OPTIONS\n)((.|\n)*?)(?=INTERIEUR|INTÉRIEUR|EQUIPEMENTS DE SERIE)"

And test

module PdfParserSpec (spec) where

import Test.Hspec
import Test.QuickCheck
import PdfParser(parseOptions)

spec :: Spec
spec =  do
  describe "PdfParser (parseOptions)" $ do
    it "return List of options" $ do
      referencialText <- readFile "test/assets/referential.txt"
      parseOptions referencialText `shouldBe` [
        "Peinture métallisée 550 €"
        ,"Jantes alliage 17\" Viva Stella [RDIF21] 300 €"
        ,"Chargeur sans fil 250 €"
        ,"Roue de secours tôle [RSEC01] 150 €"]

But when I read the text file, all my char éè etc.. are replace by \233f\233. Then my regex don't work.

Test result :

 test/PdfParserSpec.hs:12:7: 
  1) PdfParser, PdfParser (parseOptions), return List of options
       expected: ["Peinture m\233tallis\233e 550 \8364","Jantes alliage 17\" Viva Stella [RDIF21] 300 \8364","Chargeur sans fil 250 \8364","Roue de secours t\244le [RSEC01] 150 \8364"]
        but got: ["s alliage 17\" Viva Stella [RDIF21] 300 \8364","Chargeur sans fil 250 \8364","Roue de secours t\244le [RSEC01] 150 \8364","INT\201RIEUR","Sellerie Zen (Au lieu de Selleri"]

My regex work on my file -> https://regex101.com/r/HYBmMh/1

How can I fix that ?


Solution

  • I changed hackage regex-pcre-builtin to light-pcre. And it works !

    I haved to encode my strings into ut8 bytestring then add utf8 compile-time flag

    module PdfParser (parseOptions) where
    
    import Text.Regex.PCRE.Light(compile, utf8, match)
    import Data.ByteString.UTF8(toString, fromString)
    import Data.List.Split
    import Data.String.Utils(strip)
    
    parseOptions :: String -> Maybe [String]
    parseOptions s = (splitOn "\n" . strip . toString . (!!0)) <$> (match regex (fromString s) [])
      where 
        regex = compile (fromString "(?<=OPTIONS\n)([\\s\\S]*?)(?=INTÉRIEUR)") [utf8]
    

    Thank you for your comments :)