Genome assembly validation
Despite continued advances in the development of assembly algorithms, few tools are available that evaluate the correctness of the assemblies generated. With the exception of the few genomes that are manually curated by experts during an expensive process called finishing, most genome data is published as "draft" assemblies whose quality is uncertain. The correctness of the long range connectivity of the assembly is an essential prerequisite for any comparative genomic studies, as mis-assemblies can lead to incorrect conclusions.
Our group has been developing assembly validation tools that make use of all available information about the assembly. We are developing both visual interfaces that enable the manual inspection of assemblies and automated tools for detecting and correcting mis-assemblies. We are exploring the use of varied sources of information that provide clues regarding the correctness of assemblies. Examples of such data are:
- Mate-pair information - In most cases, shotgun reads are obtained by sequencing both ends of DNA fragments whose approximate size is known. This information constrains the placement of the reads within the assembly. In an ideal assembly, all read pairs are placed in such a manner as to satisfy the orientation and distance constraints imposed by the sequencing library. Most types of mis-assemblies lead to violations of these constraints. Our software tools identify such constraint violations and attempt to characterize the specific type of mis-assembly.
- Unused read information - Not all reads provided as input to an assembler are used in the final assembly. The unused reads, also called singletons, are often contaminants or insufficiently trimmed reads from the genome. Mis-assemblies, however, also lead to the presence of unused reads, as they are inconsistent with the chosen reconstruction of the genome. As an example, the reads spanning the join point of two copies of a tandem repeat are listed as singletons when the assembler incorrectly collapses this repeat. By aligning the singletons to the contigs produced by the assembler we can identify such misassemblies.
- Correlated polymorphisms - Mis-assemblies are characterized by the incorrect placement of reads within the assembly. Reads generated from different copies of a same repeat are assembled together if the repeat copies are sufficiently similar. Such situations can be identified by examining differences between the reads that cover the mis-assembled region. While differences between reads are expected due to sequencing errors, such differences are usually uncorrelated, leading to a very low probabilty that two overlapping reads have a same sequencing error at the exact same location. In the case of mis-assemblies, however, such errors are correlated, providing a recognizable signature.
- Consistency of assembly with the parameters of the sequencing process. By modeling the characteristics of the sequencing process we can estimate the likelihood that a set of reads could have been generated from the assembled data. This likelihood provides a global assembly quality measure that can be compared across different assemblies of the same data, either performed with multiple assemblers, or with different sets of parameters of a same assembler.
- Optical mapping data - For some genomes, scientists perform optical mapping experiments that identify the locations along the chromosomes of a set of restriction sites. By comparing these experimental maps with the in silico placement of the restriction sites along contigs, we can identify assembly errors highlighted by differences between these maps.