r database bioinformatics data-warehouse

Beginner guide to omics data warehousing

I work in bioscience field, mainly involved in data analysis. Lately, numbers are growing and things are getting more complicated by the use of multiple analysis techniques (most of them "omics" type) on various biological samples from the same set of individuals/patients/animals.

I would like to implement a better way of locally storing data and meta-data (here I refer to meta-data as the general data about the individuals/patients/animals, but also to meta-data about instrument used in assay) which would also allow me to perform meta-analysis (mainly using R, but I would like to have a solution that can also work with SPSS). I am searching for some guides to learn the basics of building, managing and using databases, optimally tailored to biology and "omics"application.

I could summarize my situation in the following image

In summary, over the same set of samples (individual - S1 to Sn), that would be the main entry in the database, we could perform a series of experimental assays, each of which resulting in some numeric data generally organized in a csv like format with the same id, accompanied by some meta data about the assay (instrument used and similar). The creation of new entries in the database would usually be via bulk upload of those csv files.

Essentially, I would like to collect and connect everything in one place, instead of having 1 folder for every project, with related R script and raw data. From R, I would then retrieve from the general database the data relevant to a certain project, and perform a set of analyses. As of now, I am interested in a local solution, but I would like to leave the eventual predisposition for remote access open

I have no background in databases, so I am open to any solution which would better fit my needs. For example, I have read that there are relational databases and graph databases (I do have some experience with ontologies) and can't decide which would be better. Any "digested" source of general information from users who have handled similar issues, any beginner tips, or any suggestion on best solution, would be of great benefit for me to try and start something.

Solution

Actually, I disagree with the commenters who are criticizing this question, though I agree it is not specific to R or R-related programming. Maybe I just sympathize because I have been in a similar position. A better venue to ask something like this might be BioStars.

That said, I also work in academia, and I also had a similar problem. No one in my circles had a great answer.

From your diagram, it seems like you know something about relational databases, which is good. If you aren't familiar with sql-like syntax or relational database ideas, then definitely start there. I don't have a great suggestion on how to learn about these -- I had a class on mysql in college, and then started using sqlite and postgresql on my own. I very much appreciated the mysql class, so if you feel like you don't know sql-like syntax or relational database topics well, maybe you could find an online course (or take one at your university, if you are located at a school).

Specifically in R, I would start reading about connecting to a database from R/Rstudio

https://db.rstudio.com/

I use this predominantly through the RPostgresql package, which is an extension of the DBI package, I think.

Obviously get very comfortable with the tidyverse packages, if you aren't already. This is a great resource:

https://r4ds.had.co.nz/

Since we're on the topic of Hadley Wickham, and since the topic of Hadley Wickam is related to R and R-programming, I don't feel bad saying you should read this, too:

https://vita.had.co.nz/papers/tidy-data.pdf

You'll need to learn some basics about servers. I understand that you're specifically interested in doing this locally, but I suspect that there will come a time that you'll need to be able to both host locally and remotely. At least, that has been my experience. In any case, you should be using linux (I hope this isn't too contentious a statement) on your local computer, which means essentially that dealing with a local database is more or less the same as dealing with one remotely (minus some security concerns). I find Nginx easier than Apache, but that is likely a matter of taste. I use Amazon AWS when I need a public server, though if your university has hosting services, you could do a price comparison. AWS has been cheaper and easier in my experience. To manage a served database, I use Django, which is a python package. If you choose to build a django managed database, I suggest using this cookiecutter (a python package template):

https://github.com/agconti/cookiecutter-django-rest

Finally, posted below is a link to a django database framework for a currently active project for which I'm managing the data. I'm going to include another link to a R package that I'm also messing around with, which is meant to absorb some of the database data, process it, and spit it out. The latter is very much under development. It isn't in a share-able state, but it would have helped me to see something like this, I think, when I started asking similar questions to yours, so I'm going to include it.

https://github.com/BrentLab/S288CR64_database

https://github.com/cmatKhan/brentlabRnaSeqTools

If you have questions specifically related to genomics data management, feel free to ask via my email. You'll find it on github.