Christos Kozanitis, Vineet Bafna, Ravi Pandya, and George Varghese
Genome sequence data is now “Big Data” in both volume and velocity. Joined with medical records, genome data can be mined for insights for treating disease. Genomics today is dominated by batch processing: simple analytical questions take days to answer. We propose instead that genomics be made interactive so that queries on a large genome database in the cloud are answered across the network in seconds. Towards this vision, we introduce a query language, Genome Query Language (GQL), in which intervals are first class, and joins are based on intersection not equality. GQL can be used to query for large structural variations on the TCGA cancer archive using only 5-10 lines of high level code that takes around 60 seconds to execute in the Azure cloud on an input BAM file of 83 GB. GQL results can be incrementally deployed both on the UCSC browser and by refactoring an existing variant caller to provide 6x speedup. Our paper focuses on the system design and five key optimizations — cached parsing, lazy joins, materialized views and chromosomal parallelism — that speed up query processing by 100x. We also reflect on 3 years of experience designing and using GQL.