When it comes to sequencing DNA, how do we know the information we get out of our sequencing machines is accurate, and how can we ensure that our final genome assembly is as error-free as possible?
KAT, the K-mer Analysis Toolkit, developed at the Earlham Institute, represents a necessary step to ensure that this is the case - allowing bioinformaticians to work more efficiently and ultimately improve the quality of genome projects.
When we come to explain how to sequence a genome and then put it together, we often explain this in simple terms.
We imagine DNA as a long set of instructions - formulated by a molecular alphabet of As, Ts, Gs and Cs, which form three-letter words we call codons, which in turn form precise instructions; the genes, which together - along with lots of information in between - make up the blueprint for how an organism is formed.
Current genome sequencing technologies, however, involve chopping all of these instructions up into little pieces, which we sequence, and then we have to stitch these pieces back together in a computer, like a giant jigsaw puzzle.
Indeed, this is the way that genome sequencing has been tackled, with assemblers looking for the parts that match up together, then joining them up in the most likely order.
Imagine it like lego. Instead of trying to piece together long, 8x2-stud pieces with 6x2-stud pieces and 5x2-stud pieces, it’s more like making a staircase pattern out of the smaller 2x2-bit pieces, overlapping one stud at a time.
However, sequencing is not error-free and this can lead to mistakes in the assembly process. In fact, genome assembly is a very computationally-demanding process and in itself an error-strewn process.
It’s possible to go through the entire process of sequencing DNA, putting it all together using various computer algorithms and then coming out with an assembly at the end; without really knowing whether the assembly is a true reflection of the original DNA.
Using K-mers is one way to analyse the quality of the huge amount of data coming out of sequencing machines and to flag problems in our genome assemblies before we start more expensive downstream analysis.
We can essentially chop up the sequence data that we have generated into fragments of length K. All of these fragments of the same size are called K-mers and we count how many instances of the K-mer we find in our dataset, be it data directly from our sequencers or from our assemblies.
We call this the ‘K-mer spectrum’ and it contains lots of really useful properties we can use to validate our datasets - revealing information about error levels, biases between different sequence runs, how complete sequence coverage is and whether there is any contamination from unwanted sources.
What is KAT?
KAT is a novel bioinformatics tool that can be used to compare two different K-mer datasets in a number of different ways, including quality control of sequence data, contamination detection and assembly validation - i.e. you can use the tool to check your data is good enough to start with, and then to find out whether the final product is of a decent enough quality to publish, while validating your assembly at every stage.
According to Bernardo Clavijo at EI, “the first thing you should do after sequencing a genome is to use KAT, as it is way faster than assembling a genome, it’s more accurate and it will tell you up-front the quality of your assembly.”
KAT is a necessary condition for an assembly to be correct (though not sufficient). It provides a hard check to tell you whether your sequence is wrong, which is useful; as it’s much cheaper, energy-efficient and time-effective method in the long run to resequence and get better data than to waste weeks assembling a genome that will be completely wrong.
Through eliminating bad data before an assembly has even begun, including data biases and contaminants, we can be more certain that the final product is as good as it can be.
Better Genome Assemblies.
In terms of assembly validation, the tool is particularly useful. Often, with diploid genomes that can carry more than one copy of a gene, certain regions can be falsely duplicated or deleted during assembly. KAT can help to detect these artefacts by tracking both the data generated from the sequencer and data from the assembler.
This sort of analysis is particularly useful at EI, where we sequence a diverse range of organisms - some of which are not only diploid, but tetraploid (pasta wheat), hexaploid (bread wheat) or even octoploid (in the case of strawberries).
The nice trick of KAT is that it is carries out an internal back-checking of your own assembly, including completeness and accuracy of the data, using just the input and the output.
Bernardo, added: “For the wheat genome, we checked the K-mer spectra all the way through using KAT, which means we could run the whole thing once, rather than running 20 different parameters and searching for the best one. With wheat, this would have been ridiculous - in terms of both computational power and cost.”
Before KAT, a lot of money and effort could be put into a sequencing project, only to find out it’s wrong at the end. With KAT, you know that your data is good, and you can validate your results at every stage.
KAT was led by Bernardo Clavijo and Dan Mapleson with George Kettleborough, Gonzalo Garcia and Jon Wright.