Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler
1 November 2011
We present the Scalable Nucleotide Alignment Program (SNAP), a new short and long read aligner that is both more ac- curate (i.e., aligns more reads with fewer errors) and 10–100× faster than state-of-the-art tools such as BWA. Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a simple hash index of short seed sequences from the genome, similar to BLAST’s. However, SNAP greatly reduces the num- ber and cost of local alignment checks performed through sev- eral measures: it uses longer seeds to reduce the false posi- tive locations considered, leverages larger memory capacities to speed index lookup, and excludes most candidate locations without fully computing their edit distance to the read. The re- sult is an algorithm that scales well for reads from one hundred to thousands of bases long and provides a rich error model that can match classes of mutations (e.g., longer indels) that today’s fast aligners ignore. We calculate that SNAP can align a dataset with 30× coverage of a human genome in less than an hour for a cost of $2 on Amazon EC2, with higher accuracy than BWA. Finally, we describe ongoing work to further improve SNAP.