Barcode correction for linked-read sequencing data
Linked reads are a new and cost-effective type of short-read sequencing data that provides long-range sequence information in barcodes that label the origin of a read pair from a longer DNA molecule. A common preprocessing step in linked-read data analysis is barcode correction, where the barcodes in the data are matched to those on a given whitelist.
Here, I will introduce bctools, a new toolbox for barcode correction. The toolbox can infer a whitelist of barcodes from the data, correct barcodes that have a Hamming distance of 1 to a whitelisted barcode using a succinct index data structure and compute basic barcode statistics. Bctools is several times faster than the longranger basic pipeline, outputs more corrections that can be validated with a read cloud, reduces false corrections by using an inferred whitelist of barcodes, and can output a list of alternative corrections in a versatile output format. Thus, it promises to become a valuable asset for the analysis of linked-read sequencing data.
Birte Kehr is leading the Junior Research Group Genome Informatics at the Berlin Institute of Health (BIH) since November 2016. Prior to joining the BIH, she worked as a Research Scientist at deCODE genetics in Iceland. There she gained experience in working with large-scale genomic data and developed a particular interest in structural variation discovery for understanding human disease. She received her PhD from the Freie Universität Berlin within the International Max Planck Research School for Computational Biology and Scientific Computing in 2014. Her thesis addressed algorithms and data structures for multiple whole-genome alignment.