I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB. In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:
{-# LANGUAGE OverloadedStrings #-}
import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative
main = do
(inputFile : outputFile : _) <- getArgs
runResourceT $ sourceFile inputFile
$= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify
$= CT.encode CT.utf8 $$ sinkFile outputFile
jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
where
go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
Right res -> res <> "\n"
_ -> ""
go _ = ""
keyTo_id :: Parser T.Text
keyTo_id = skipWhile(/='{') >> T.snoc <$>
(T.cons <$> (char '{')
<*> (T.concat <$> many1 ( bracket
<|> (string "\"key\":" >> return "\"_id\":")
<|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"')
<|> T.singleton <$> satisfy (/= '}')
)))
<*> char '}'
bracket :: Parser T.Text
bracket = T.cons <$> char '{'
<*> scan 1 test
where
test :: Int -> Char -> Maybe Int
test 0 _ = Nothing
test i '}'= Just (i-1)
test i '{' = Just (i+1)
test i _ = Just i
According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.
Surely there's a better way to return bracketed terms unchanged if the brackets match up?
I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.
EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz
Use the scan
function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far.
When your state is 0, that means that braces match inside the current substring.
The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.
Also, you could gain some performance even with your current algorithm by using lazy Text — the concat
function would work more efficiently.