Search code examples
jsonsphinx

Sphinx 3 Search engine: Having problems reading JSON from CSV source


When I try to read JSON content from a field I get:

WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'

Here are the details:

This is the (super simplified) CSV file I'm trying to read:

1,hello world, document number one,a:foo
22,hello again, document number two,foo:bar
23,hello now, This is some stuff,foo:{bar:baz}
24,hello cow, more test stuff and things,{foo:bar}
55,hello suess, box and sox and goats and moats,[a]
56,hello raven, nevermore said the thing,foo:bar

When I run the indexer this is the result I get:


../bin/indexer --config /home/ec2-user/sphinx/etc/sphinx.conf --all --rotate


Sphinx 3.3.1 (commit b72d67b)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/home/ec2-user/sphinx/etc/sphinx.conf'...
indexing index 'csvtest'...
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
WARNING: document 22, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
WARNING: document 23, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:{bar:baz}'
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
WARNING: document 56, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
collected 6 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 6 docs, 0.1 Kb
total 0.0 sec, 17.7 Kb/sec, 1709 docs/sec
rotating indices: successfully sent SIGHUP to searchd (pid=14393).

This is the entire config file:

source csvsrc
{
    type = csvpipe
    csvpipe_delimiter = ,
    csvpipe_command = cat /home/ec2-user/sphinx/etc/example.csv
    csvpipe_field_string =t
    csvpipe_attr_string =c
    csvpipe_attr_json =assorted
}



index csvtest
{
    source          = csvsrc
    path            = /var/data/test7
    morphology      = stem_en
    rt_field = t
    rt_field = c
    rt_field = assorted

}


indexer
{
    mem_limit       = 128M
}

searchd
{
    listen          = 9312
    listen          = 9306:mysql41
    log             = /var/log/searchd.log
    query_log       = /var/log/query.log
    pid_file        = /var/log/searchd.pid
    binlog_path     = /var/data
}

And If I do log in and query, it's pretty obvious that the JSON was not, in fact, indexed (as expected from the warnings)

 select * from csvtest;
+------+-------------+----------------------------------+----------+
| id   | t           | c                                | assorted |
+------+-------------+----------------------------------+----------+
|    1 | hello world |  document number one             | NULL     |
|   22 | hello again |  document number two             | NULL     |
|   23 | hello now   |  This is some stuff              | NULL     |
|   24 | hello cow   |  more test stuff and things      | NULL     |
|   55 | hello suess |  box and sox and goats and moats | NULL     |
|   56 | hello raven |  nevermore said the thing        | NULL     |
+------+-------------+----------------------------------+----------+
6 rows in set (0.00 sec)

I have tried a few things, but I'm just groping in the dark. Some things I have tried:

  1. Alternate formats of JSON. I have tried using {foo:bar} and {[foo:bar]} and [{foo,bar}] based on some experiences with other JSON inputs where they want it to be either an array or dict at the top level. These actually generate slightly different errors:
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
  1. I have tried adding a trailing comma thinking that might be the $end token that the parser is looking for. This generates an actual error ERROR: index 'csvtest': source 'csvsrc': not all columns found (found=5, total=4, line=1). which prevents index generation. That makes sense to me

2a) I tried adding a whole other column after the JSON so I could have the ending comma but not get an error that would prevent the index from generating. This did generate the index, but did not provide the $end token that the JSON parser was looking for.

I'm totally stumped.


Solution

  • Well as such a:foo isnt a valid JSON value AFAIK. LOoks like it meant to be object? So would need {...} surrounding it.

    But even {foo:bar} is not valid either. At the very least the 'value' shoud be quoted {foo:"bar"}. But really the keys quoting too {"foo":"bar"}

    Javascript Objects technically allow unquoted key names, but JSON requires the quoting.

    ... but also remember it CSV. Quotes are typically used for quoting (eg when columns contain commas), so the quotes need double encoding! Ends up a bit messy...

    24,hello cow, more test stuff and things,"{""foo"":""bar""}"