The goal of WP4 (Cross-cutting Methods for Oncology and Genetics), which is the most mathematics- and theory-driven work package in SOUND, is to address overarching challenges posed by multi-omics patient data through the development of novel mathematical and statistical methods for data analysis. The specific choice of tasks 1-4 emerged from the preparation meetings for this consortium, where these topics were recurrently identified as essential requirements for many partners.
A fundamental challenge is the N << p or high-dimensionality problem. It refers to the situation where the number of measured features or variables (p) is large compared to the sample size (N). This problem occurs frequently because ’omics data are very high-dimensional (many genes, mutations, etc.), while the number of observations (patients, tumours, etc.) is relatively small. Together with the presence of noise, which is abundant in high-throughput genomic technologies, the N << p challenge leads to ill-posed inverse problems. To address this statistical challenge, in T4.1, we will use regularization methods in order to enforce solutions that are sparse (many model parameters are zero), smooth (related model parameters are similar), or fulfil structural constraints derived from prior knowledge, e.g., in the form of biological networks. Our goal here is to enable simple, cost-effective, and communicable clinical models that can detect patient-specific disease markers (e.g., genes, mutations, mutation signatures) and support clinical decisions, patient stratification, and drug target identification.
In the context of high dimensional ’omics data, three additional specific challenges arise. The first one concerns automatic or semi-automatic outlier detection in heterogeneous data, an important task supporting both the search for technical errors as well as for exceptional medical cases (T4.2). The second challenge is to establish causal relations among the features sampled in observational studies, when no or little interventional data are available (T4.3). The third one is to address longitudinal (i.e. time course) high-dimensional data, including those with features that are collinear, missing, or censored (T4.4). Genomic data on tumour progression is of this type, and we will develop statistical methods that can provide adequate patient stratification and further support diagnosis, improve prognosis, and guide treatment decisions.
- ETH Zurich (Lead Partner)
- European Molecular Biology Laboratory, Heidelberg
- Roswell Park Cancer Comprehensive Center, Buffalo
- University of Cambridge
- Technische Universität München
- Instituto de Engenharia Mecânica, Lisbon