The New Frontier of Microbiome Science: Computational Challenges and Solutions
The microbiome refers to the whole sum of microorganisms in a particular environment, such as the collective sum of gut bacteria in a human being. Microbiome research is a new frontier of scientific exploration. Studies that use big data technology to examine whole genomes of hundreds of organisms simultaneously represent a field called metagenomics. As this field matures, scientists are increasingly recognizing the need for sophisticated tools and technologies to decipher the complexities hidden within these microbial ecosystems.
To that end, on April 2, Mihai Pop, a professor in the Department of Computer Science and the director of the Institute for Advanced Computer Studies at the University of Maryland, gave a talk on the analytical challenges of microbiome science and how they can be combated by computational methods. The talk focused on the pivotal role of computational tools in unraveling the secrets of microbiomes and addressing the challenges associated with analyzing the vast datasets generated by these studies.
A key focus of metagenomics is the taxonomic classification of different microbes. The primary method for organizing and classifying microbes is comparing them to a database of known organism sequences. These similarity-based techniques are especially effective when the organisms in the sample are well represented in the database. Pop mentioned one of the most common similarity search methods used to classify microorganisms, the Basic Local Alignment Search Tool (BLAST). However, BLAST often misidentifies the closest neighbor to the microorganism of interest; the “most similar” organism according to BLAST may not actually be the most closely related.
“How can we find what’s the real [closest hit] if there is a hit? The E-value is misleading,” Pop explained during the talk, suggesting that BLAST may not always accurately identify the most similar organism to the microbiome of interest.
The E-values Pop mentioned refer to parameters in BLAST that describe the number of hits one can “expect” to see by chance when searching a database of a particular size. Pop also emphasized how many of these problems were only discovered years after BLAST integrated into common use.
“These are things we found out 20, 30, 40 years after [the computational tool] was written ... even though something has been used for many, many years, there [are] still things to learn about it,” Pop explained.
One of the other main challenges Pop highlighted is how the structure of biological databases affects scientists’ ability to reliably reveal insights on the microbiome. Reference databases are not all-encompassing. Many microorganisms cannot be cultured in labs, and a large proportion of those that can have not been sequenced or added to these reference databases. Consequently, not all environmental organisms are included in the sequence database, which limits the accuracy of similarity-based methods.
These problems are further compounded by the lack of contiguous information available in most sequencing datasets. Many sequencing analyses have to begin by joining together many sequence fragments and stitching together a whole related sequence. Assembling the sequencing data is also an unstandardized process, as new technologies used for assembling genomes are constantly being developed. These limitations can impede researchers' ability to derive meaningful insights and connections from microbiome datasets, because it substantially limits precision and decreases the accuracy of reference databases.
Pop then transitioned to discussing algorithms and software approaches to sequence similarity. Many current software used in classification employ the most recent common ancestor (MRCA) method. MRCA provides an annotation (marking of a specific feature of the DNA sequence) at the broadest taxonomic class that encompasses all of the possible markings in a sequence. However, this means that different types of software that use MRCA only make a few classifications at the genus or species level, meaning that stronger relationships between two microbes cannot be determined at the family, class or phylum level.
To address this challenge, Pop shared efforts from his own lab to develop advanced computational tools tailored specifically for microbiome analyses. He specifically focused on the Ambiguous Taxonomy eLucidation by Apportionment of Sequences (ATLAS). ATLAS is a data-driven database partitioning method, which aims to divide a large dataset into smaller, more easily analyzable datasets. ATLAS groups sequences into biologically meaningful partitions by querying the sequence against a reference database and then identifying and clustering hits that are considered significant. ATLAS also represents a shift away from the MRCA method.
As the talk concluded, Pop emphasized the critical need for interdisciplinary collaboration to advance microbiome research. Integrating expertise from fields such as biology, computer science and statistics is essential for developing innovative solutions to microbiome-related challenges. This interdisciplinary approach allows researchers to harness the power of computational tools to extract meaningful patterns and associations from microbiome datasets.
—This article by Shreya Tiwari was originally published in The News-Letter, a student publication at John Hopkins University.