BioStudies - one package for all the data supporting a study


BioStudies - one package for all the data supporting a study

The BioStudies database holds descriptions of biological studies with links to the underlying data in EMBL‐EBI databases or elsewhere. This defines the scope of BioStudies: if the study is supported by data available at EMBL‐EBI or associated to full text articles in Europe PMC, then it can be included in BioStudies. When necessary, BioStudies can also host data that do not fit in the existing structured archives.

We define a study as a biological experiment or set of experiments that are usually, but not exclusively, linked to an article. Studies may be of different scales or granularities; for example, the 1,000 Genomes Project can be considered one large‐scale study, but may include or be linked to many other (smaller) studies; studies may therefore have a simple hierarchical structure. From the implementation perspective, BioStudies can be seen as a system for managing data files plus metadata associated with the entire study, or with its specific parts, such as individual data files. The accompanying metadata includes fields such as title, authors and submission date, making the BioStudy citable. Further structured metadata are optional and can describe how data files were generated, or give biological context (such as tissue sampled). These fields alongside a free‐text study description field provide the content basis for searching across datasets.

BioStudies will ingest data from Europe PMC as well as taking direct submissions. When an article is deposited in Europe PMC and mentions the persistent identifier of a dataset (Accession number or DOI) in the text, or has supplemental data attached to the article, the BioStudies metadata as outlined in Box 1 will be generated automatically and deposited in BioStudies. This will provide a collated view of all the data associated with an article, supporting the Use Cases described above. The accession numbers and data DOIs are extracted by the Europe PMC text‐mining pipeline (Kafkas et al2013), which currently covers 20 major data resources in the life sciences (ENA, SNPs, PDBe, RefSeq, ClinicalTrials.gov, EudraCT, OMIM, GO, UniProt, Pfam, ArrayExpress, Ensembl, InterPro, BioProject, Proteome Exchange, BioSample, Embd, TreeFam, EGA and data DOIs).

BioStudies will include study data relating any of the EMBL‐EBI archival databases. Currently, BioStudies accepts submissions from large projects and will accept individual submissions in late 2015. We expect that BioStudies will be capable of storing large files (such as image sets), fitting with other big data community requirements met at the EMBL‐EBI. As an EMBL‐EBI archive, there is a long‐term commitment to supporting BioStudies, for as long as it is useful to the scientific community.

https://www.ebi.ac.uk/biostudies/