I'm very new to Haskell. I'd like to be able to find some color expressions in a string. So let's say I have this list of expressions:
colorWords = ["blue", "green", "blue green"]
And I want to be able to get the locations of all of those, anywhere in a string, even if it's broken up by a linebreak, or if a hyphen separates it instead. So given a string like:
First there was blue
and then there was Green,
and then blue
green all of a sudden, and not to mention blue-green
It should give the character offsets for "blue" (line one), "green" (line two), and "blue green" (lines 3-4) and "blue-green" (line 4), something like:
[("blue", [20]), ("green", [40]), ("blue green", [50, 65])]
I can do this with regexes, but I've been trying to do it with a parser just as an exercise. I'm guessing it's something like:
import Text.ParserCombinators.Parsec
separator = spaces <|> "-" <|> "\n"
colorExp colorString = if (length (words colorString))>1 then
multiWordColorExp colorString
else colorString
multiWordColorExp :: Parser -> String
multiWordColorExp colorString = do
intercalate separator (words colorString)
But I have no idea what I'm doing, and I'm not really getting anywhere with this.
We can find substring locations with a parser by using the sepCap
combinator from replace-megaparsec.
Here's a solution to your example problem. Requires packages megaparsec, replace-megaparsec, containers
.
References:
string'
choice
getOffset
try
from Megaparsec.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
import Data.Map.Strict as Map
let colorWords :: Parsec Void String (String, [Int])
colorWords = do
i <- getOffset
c <- choice
[ try $ string' "blue" >>
anySingle >>
string' "green" >>
pure "blue green"
, try $ string' "blue" >> pure "blue"
, try $ string' "green" >> pure "green"
]
return (c,[i])
input = "First there was blue\nand then there was Green,\nand then blue\ngreen all of a sudden, and not to mention blue-green"
Map.toList $ Map.fromListWith mappend $ rights $ fromJust
$ parseMaybe (sepCap colorWords) input
[("blue",[16]),("blue green",[103,56]),("green",[40])]