While GPUs provide low-cost and efficient platforms for accelerating massively parallel applications, tedious tuning is required to maximize performance. In addition to a complex programming model, there is a lack of performance portability across various systems with different runtime properties. Programmers usually make assumptions about runtime properties in order to optimize their code. However, if any of these properties change during execution, the optimized code performs poorly. To alleviate such limitations, several implementations of the application are necessary to maximize performance for different runtime properties. However, it is not practical for the programmer to write several different versions of the same code which are optimized for each individual runtime condition. In this talk, I will show how several runtime properties, such as device configuration, input size, dependency, and data values, impact the performance of fixed implementation code. Next, I will present a static and dynamic compiler framework to relieve the programmer of the burden of fine tuning different implementations of the same code. This framework allows the programmer to write a program once and use a static compiler to generate different versions of a data parallel application with several tuning parameters. A runtime system selects the best version and fine tunes its parameters based on runtime properties. Finally, I will discuss some open challenges and my future plans for providing performance portability across different sets of accelerators.