Search code examples
searchsolrlucenepandasanalytics

Query time analytics with Solr faceting and pivoting and log file data


I am doing some analytics using Solr and specifically using the faceting and pivot functionality for a large set of log files. I have a large log file that I have indexed in Solr along the lines of.

    Keyword  Visits log_date_ISO
1    red      1,938  2013-01-01
2    blue     435    2013-02-01
3    green    318    2013-04-01
4    red blue 279    2013-01-01

I then run a query and facet by 'log_date_ISO' to give me keyword counts by date that contain the query term. Two questions:

(1) Is there a way to sum the visits per keyword for each date - because what I really want is to sum visits across keywords that contain the query:

-> e.g. if I ran query 'red' for the above - I would want date 2013-01-01 to have a count of 1938 + 279 = 2217 (i.e. the sum of the visits associated with the keywords that contain the query 'red') rather than '2' (i.e. the count of the keywords containing the query).

(2) Is there a way to normalise by monthly query volume?

-> e.g. if the query volume for '2013-01-01' was 10,000 then the normalised volume for the query 'red' would be 2217/10000 = 0.2217

LAST RESORT: If these are not possible, I will pre-process the log file using pandas/python to group by date, then by keyword then normalise - but was wondering if it was possible in Solr.

Thanks in advance.


Solution

  • One can use solr to group by one field in the records and sum another field in the records, by group, using

    (1) Facets/pivots (groups data by a specified field)

    (2) StatComponent (calculates field statistics for specified field - including the sum)

    The call I made is (differently from the names in the question, the 'Keyword' field is called 'q_string', 'Visits' above is called 'q_visits' and 'log_date_ISO' is called 'q_date' in the below):

    http://localhost:8983/solr/select?q=neuron&stats=true&stats.field=q_visits&rows=1&indent=true&stats.facet=q_date
    

    This provides basic statistics - including the sum - for the *q_visits* field by date - the specific value I was interested in was the sum:

    <double name="min">1.0</double>
    <double name="max">435.0</double>
    <long name="count">263</long>
    <long name="missing">0</long>
    <double name="sum">845.0</double>
    <double name="sumOfSquares">192917.0</double>
    <double name="mean">3.2129277566539924</double>
    <double name="stddev">26.94368427501248</double>
    

    The field for which the statics are gathered is declared as type float in schema.xml (if its declared as a string then sum, sd, mean will not be shown).