Search code examples
repositorydspacedata-management

"Data Repository" software solution


I am trying to find a software solution that will allow our group to easily upload datasets (scriptable and or through some UI), tag those datasets, retrieve those datasets, access control for the datasets, search the tags, search the files name/attributes/metadata (e.g. file creation date). The datasets can be anything from CSV files, image(binary) datasets, texts, server logs, folders within folders of images, zip files of csv data. It can be anything. We will need to be storing GBs to potentially PBs of data. A single file can range from a few KB to 100's of GB. Usable API to retrieve these datasets programmatically.

We just want to have a centralized location of finding information and we want to be able to answer a question such as "Hey do you know if we have any lightening strike datasets?" If there is a file/folder/zip file tagged with "lightening" when I search it should pull back that dataset.

A possible solution would be something like Dataverse, Dspace, Fedora Commons, CKAN. However, those seem to be really geared towards academia and publications or small datasets. On top of that they remove any type of complex folder structure that might exist (e.g. Folder1-->subFolder1-->subFolder2). I also question the scalability of having a 10 million 100kb files within one of these systems.

A filesystem share would allow us to simply store whatever we want but I don't know of a reasonable way of enabling tagging of data.

It is almost like I am looking for a combination of the two. Does someone know of a tool preferably open source that would be able to do something like this?


Solution

  • From what you have described so far, DSpace does seem to be a good fit.

    With following examples I want to address the concerns you raised:

    Scalability Here's an example of a multi-terabyte item: https://ore.exeter.ac.uk/repository/handle/10871/14881

    Complex structure Dryad is based on DSpace and uses a more complex data model, with data files, data packages and the original publication each being represented as separate objects: http://datadryad.org/resource/doi:10.5061/dryad.322vn

    If that's what you want, you can also start your project off the Dryad codebase, since this one is open source as well: https://github.com/datadryad/dryad-repo