Generating Fast String Manipulating Code Through Transducer Exploration and SIMD Integration

MSR-TR-2011-124 |

Security sanitizers have long been known to be very difficult to implement correctly. Moreover, with the rise of the web, developers need string manipulating functions in both “server” and “client” languages. Hand-writing these functions separately is an open invitation to bugs. At the same time, auto-generated code will not be accepted unless it is significantly faster than previous hand-written code. We address this problem with two complementary approaches centered around bek, a domain-specific language for writing complex string manipulation routines. First, bek compiles the input domain-specific program into an intermediate format consisting of symbolic finite state transducers, which extend classical transducers with symbolic predicates. In this paper, we present a novel algorithm that we call “exploration” which performs a symbolic partial evaluation of these transducers to obtain simplified, stateless versions of the original program. These simplified versions can be lifted back to bek, and from there compiled to C#, C, or JavaScript. Second, we explore how SIMD instructions can be combined with bek compilation to C and C++, enabling developers to access parallel features of modern architectures without needing to tweak the C compiler or hand-write assembly. We have implemented our code generation pipeline for bek code corresponding to several real string sanitizers. We use an automatic testing approach to compare our generated code to the original C# implementations and found no semantic deviations. Our generated C# code outperforms the previous hand-tuned code by a factor of up to 2.5. For C code with SIMD, we see speedups of 2.5 times compared to native C code for the same sanitizer.