If your research relies upon PacBio assemblies, you must know that those may not be the genomes you are looking for 🪄 The most recent work from our group led by Florian Trigodet in collaboration with Jill Banfield's group explains why that might be the case for applications of long-read sequencing to metagenomics.
Long-read assemblers play a significant role in turning individual long reads into long genomic segments, and they have tremendous implications on our downstream work. Last year we applied some of the new assemblers to our PacBio datasets from marine samples. Results were extremely exciting at first -- many circular elements, near-complete genomes, etc. But a closer inspection made us realize that the results didn't make much sense in some cases.
Some 'circular' contigs were just fragments of larger genomes, and there were way too many circular contigs that were short yet didn't resemble plasmids or viruses. Instead, most of these short circular sequences looked like fragments of larger genomes.
As is often the case when faced with surprising results, we thought that we probably were wrong in raising red flags, and those weird contigs were not weird at all. Perhaps this was a new perspective that comes to light thanks to the availability of long reads. But then during a conversation with Jill Banfield I learned that her group was making similar puzzling observations, and these results didn't make sense to them either. Then we knew it was the assemblers that were likely in the wrong, not us. So we decided to dig deeper. This study is the result of that.
Long reads, especially those from PacBio, are great. And long-read assemblers generally do a great job. For which we can't thank the developers of these algorithms enough as they are the heroes here who take on the notoriously difficult technical task of metagenomic assembly to help us push the boundaries of what we can learn from complex environments.
BUT we also know now that it is not uncommon to find in the long-read assembly results multi-domain chimeras, prematurely circularized sequences, haplotyping errors, excessive repeats, and even phantom sequences that don't match any of the input long-reads.
If you think you are in the same boat with us, please read our study here,
https://lnkd.in/efgTSM2z
find our bioinformatics workflow that supports the findings in this study here,
https://lnkd.in/e358_AfK
or skim through the BlueSky thread by Florian Trigodet that shares some pretty figures here:
https://lnkd.in/eu2YUfyW
Thank you! 😇