Creation of a Relational Database for Identifying Potential Functional DNA Sequence Motifs in A. thaliana and Other Genomes

PhD Thesis: Undergraduate Honors thesis, Dept. Biology, Dartmouth College |

Dartmouth Kemeny Computing prize and Reed Biology Award

Understanding the molecular basis for gene expression regulation is an important problem in biology. A fundamental sub-problem is understanding how DNA sequence information allows for the molecular control of regulation. With the increasing availability of fully sequenced genomes, we can begin to look directly at DNA for the answers. We have created a database that catalogs the position of every 9-mer in close proximity to every gene in the Arabidopsis thaliana genome. This allows us to search for motifs that are non-randomly distributed throughout the genome and so may serve some biological function. We have also created a website to serve as an interface. Here we discuss the structure and design of the database and website and how they can enable a biologist to identify putative cis-acting regulatory motifs. We show specific methods that can be used to identify non-random motifs on the genomic level that are likely to be involved in basal regulation or the regulation of large sets of genes. We also describe methods that are specific for sets of co-regulated genes. Using our database and website, we have easily detected a number of known cis-acting regulatory motifs, as well as a number of motifs that may represent novel elements. Introns were also analyzed, and known splicing elements were easily found. A group of co-regulated phase-0 clock genes has been analyzed as well. Known regulatory motifs were analyzed with a genomic perspective, and a potentially novel motif was identified.