Search code examples
dspace

DSpace 5.1 Solr item count totals out of sync


I am helping to support a DSpace 5.1 installation. Our client has reported a long-standing issue whereby the All Items count (in /statistics) does not match the sum of all other types of items counts, and diverges over time.

I'm guessing not all the operations (eg. withdrawing an item?) are correctly updating the cached values, which appear to come from the Solr 'statistics' core.

I think that what I would need to do is run [dspace]/bin/dspace solr-reindex-statistics (Reindex SOLR statistics, for upgrades or whenever the Solr schema for statistics is changed), but this results in a usage error in DSpace 5.1; it appears that the solr-reindex-statistics command is not available in DSpace 5.1

Given that we have apparently fixed this sort of issue before, I'm thinking this would have been fixed when doing a reindex as part of a significant upgrade.

I think the procedure I need to follow is as:

  1. stop tomcat
  2. backup [dspace]/solr/statistics
  3. start tomcat
  4. as tomcat, run [dspace]/bin/dspace stats-util -b -r
  5. when done, restart tomcat

Does this appear to be a sane thing to do? I only want to update the item counts really, I don't want to lose anything that can't re rebuilt.

Looking at my previous upgrade nodes when we went to 5.1 (which was either from 5.0, or from 4.x, I'm not sure what version we came from), we did the following:

su - tomcat -s /bin/bash
  /usr/local/dspace/bin/dspace index-db-browse -f -d
  /usr/local/dspace/bin/dspace index-discovery -bf   ### perhaps an hour
  /usr/local/dspace/bin/dspace oai import -c -o
  /usr/local/dspace/bin/dspace oai clean-cache
  logout

In a subsequent upgrade, when we moved to the Mirage2 interface, we have also done the [dspace]/bin/dspace index-discovery -b process, which took the better part of an hour to run.

Not sure if that's part of the solution, but it seems like a heavy sort of hammer.

I neither develop or drive the maintenance schedule of this deployment, I just do the deployment and operations. Unfortunately the Dev side has had a number of staffing changes, so an upgrade is not feasible at present and we've lost some institutional knowledge about this platform.

Thank you very much, Cameron


Solution

  • There are 2 statistics mechanisms in DSpace 5.

    The SOLR-based statistics are available at the links named "Usage Statistics".

    If SOLR is running properly, those statistics should be collected. The "stats-util" cron tasks support collection of these statistics, but they should not be required for you to see reported numbers. Run "stats-util -h" for usage information about each of the option.

    The Solr Statistics are reported at each level of the hierarchy by clicking on the "Usage Statistics" links. Unfortuanately, the usage numbers for a community or collection show visits to that community/collection. They do not show cumulative counts for all items within that collection or community.

    The "legacy statistics" are pulled from log files. Those links are available under /statistics. These statistics are generated using the tasks "stat-monthly" and "stat-general" tasks. I have disabled these reports in my instance because I have not found the numbers to be reliable.

    See https://wiki.duraspace.org/display/DSDOC7x/Command+Line+Operations#CommandLineOperations-Legacystatistics for more information. Note the recommendation to use Solr Statistics.

    Check out https://wiki.duraspace.org/display/DSPACE/Support if you need additional support.