Netherlands eScience Center co-develops portable workflow for structural variant detection

8 Jan 2020 - 4 min

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases. Despite advances in whole genome sequencing, however, comprehensive and accurate detection of SVs in short-read data still poses some practical and computational challenges. To address this problem, a team of scientists and research software engineers from the Netherlands eScience Center has developed a new portable open-source workflow called sv-callers that enables improved detection of SVs in cancer genomes using multiple tools. Their work was recently published in PeerJ – the Journal of Life and Environmental Sciences.

Structural variants such as deletions, insertions and duplications, account for a large part of the genomic diversity among individuals and have been implicated in many diseases, including cancer. With the advent of novel DNA sequencing technologies, whole genome sequencing (WGS) is becoming an integral part of cancer diagnostics and can potentially enable tailored treatments of individual patients. However, despite advances in large-scale cancer genomics projects, the detection of SVs in genomes remains challenging due to computational and algorithmic limitations.

The ensemble approach

“Recent tools for somatic and germline SV detection exploit more than just one type of information present in WGS data”, says Dr Arnold Kuzniar, eScience Engineer and first author. “A promising way to obtain more accurate and comprehensive results is by using what is known as the ensemble approach, which has been shown to improve the detection of SVs. Nevertheless, running multiple SV tools efficiently on a user’s computational infrastructure or adding new SV callers as they become available has been difficult.”

According to Kuzniar, a common practice is to couple multiple tools, or “callers”, together with monolithic wrapper scripts and, to a lesser extent, by a workflow system. Such a workflow is recommended as a way to improve the extensibility, portability and reproducibility of data-intensive analyses, but is usually developed to run on one computer system and therefore not necessarily portable to or reusable on another system.

SV callers tied together

To address these problems, the team developed “sv-callers”, a user-friendly, portable and scalable workflow based on the Snakemake and Xenon (middleware) software. The workflow includes state-of-the-art somatic and germline SV callers, which can easily be extended, and runs on high performance computing clusters or clouds with minimal effort. It supports all the major SV types as detected by the individual callers.

“The workflow was developed incrementally based on requirements in the context of the Googling the cancer genome project, which is led by Dr Jeroen de Ridder from the University Medical Center Utrecht and supported by the eScience Center”, says Kuzniar. “We have extensively tested the workflow with [human] WGS datasets on different HPC systems as well as performed a number of production runs on the genomes of cancer patients. The workflow readily automated parallel execution of the tools across compute nodes and enabled streamlined data analyses in a Jupyter Notebook.”

Kuzniar credits the workflow’s results to the wide-ranging expertise of the individual project partners. “Developing this workflow was truly a collective endeavor. Without the in-depth knowledge of and experience with short read sequencing data and SV detection in particular, the workflow would have been computationally efficient but the results incomplete or inaccurate from a biological point of view. I am really happy to be part of what we’ve achieved together.”

The team has already made the workflow freely available and moving forward intends to maintain the software.

Publication details

Kuzniar A, Maassen J, Verhoeven S, Santuari L, Shneider C, Kloosterman WP, de Ridder J. “SV-Callers: A Highly Portable Parallel Workflow for Structural Variant Detection on Whole-Genome Sequence Data” in PeerJ (6 January 2020).