The overwhelming increase in sequencing methodology resulted in the accumulation of millions of DNA sequences. These sequences are collected from thousands of genomes that (ideally) sample the ‘tree of life’. I will briefly discuss the ‘minimal set of instructions’ by which a linear sequence is transformed into a functional protein. What happen when the statistical noise is too high, thus classical procedures to predict protein sequences fail? I will focus on the challenge of identifying short proteins that remain buried in the genomic data. For illustration, I will take you for a ‘treasure hunt’ for short proteins.
Many short proteins share fuzzy features that are common to most animal venom. I will discuss the limitation in using classical tools that are based on string comparison, or pattern finding to identify short proteins. For this task, statistical machine learning methods were useful in identifying hidden bioactive sequences in several genomes. Evidently, such sequences are attractive candidates for novel therapy. The test case of short proteins illustrates the importance of a cycle that starts by a biological hypothesis, then uses a computational formulation and finalizes by an experimental validation. Finally, I will discuss our genomes with respect to our ‘partners’ (viruses, bacteria). Once the interaction of these genomes is considered, the source for the dynamic nature of human evolution becomes evident.