Data Containers and Interoperability



This work package implements common data representations for multi-omic datasets that facilitate inter-operability between the algorithms and analyses developed in the other WPs (T8.1). Moreover, we aim to make these representations equally useful for multi-omics research in the wider community. The data representations will use established design principles that have already proven successful in enabling reproducible and integrative workflows with R / Bioconductor (Section 1.3). The data containers will be designed to manage and manipulate multiple omic data types, including all those used in SOUND (DNA variants, RNA expression, DNA methylation possibly from multiple time points and/or body sites of multiple patients), and including the appropriately interlinked metadata. A recurrent challenge is to hold large data sets that do not fit simultaneously into memory dynamically and transparently on mass-storage, without compromising the vector-based efficiencies of the R programming model. In T8.2, we pursue a relatively immediate solution to this problem, while T9.2 in collaboration pursues a more fundamental approach.

The WP will also provide important public data resources (e.g., ENCODE, ICGC, GTEx and other consortium data summaries) in a format that allows ready use in R / Bioconductor and supports integrative analyses outlined in WP1-7 (T8.3). Finally, the WP will capture data outputs from WP1-7 as instances of the well-defined data representations (T8.4). The purposes of T8.4 include documentation of SOUND activities, realization of principles of reproducible research, exemplars for training purposes (used in WP12), and enabling of integrative down-stream analyses.

Participating partners

  • Roswell Park Cancer Institute, Buffalo (Lead Partner)
  • European Molecular Biology Laboratory, Heidelberg
  • ETH Zurich
  • University of Cambridge
  • Technische Universität München
  • Instituto de Engenharia Mecânica, Lisbon
  • BeDataDriven, The Hague