Search code examples
haskellconduitattoparsec

Attoparsec: Skipping bracketed terms?


I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB. In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:

{-# LANGUAGE OverloadedStrings #-}

import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative

main = do 
        (inputFile : outputFile : _) <- getArgs
        runResourceT $ sourceFile inputFile  
                $= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify 
                $= CT.encode CT.utf8 $$ sinkFile outputFile

jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
        where 
        go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
                Right res -> res <> "\n"
                _ -> ""
        go _ = ""

keyTo_id :: Parser T.Text 
keyTo_id = skipWhile(/='{') >> T.snoc <$>
        (T.cons <$> (char '{') 
                <*> (T.concat <$> many1 ( bracket 
                    <|> (string "\"key\":" >> return "\"_id\":") 
                    <|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"') 
                    <|> T.singleton <$> satisfy (/= '}')
                    )))  
        <*> char '}'        

bracket :: Parser T.Text        
bracket = T.cons <$> char '{' 
        <*> scan 1 test
     where
        test :: Int -> Char -> Maybe Int
        test 0 _ = Nothing        
        test i  '}'= Just (i-1)
        test i '{' = Just (i+1)
        test i _ = Just i

According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.

Surely there's a better way to return bracketed terms unchanged if the brackets match up?

I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.

EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz


Solution

  • Use the scan function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far. When your state is 0, that means that braces match inside the current substring.

    The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.

    Also, you could gain some performance even with your current algorithm by using lazy Text — the concat function would work more efficiently.