Reverse Engineering the Twelve Days of Christmas

Thomas Ball
December 23, 1998
Having received the  "Twelve Days of Christmas" obfuscated C program as a Christmas present once too many times with out "opening" it, the author decides to take on Jim Coplien's challenge to reverse engineer it.  The use of pretty printing and path profiling technology was found to be a very helpful aide in understanding the structure and the behavior of this program.


[ The Present Challenge | A Formal Model of The Twelve Days of Christmas | Pretty Printing the Program | The Structure of the Program | Path Profiling the Program | The Reverse Engineered Program | Summary ]

The Present Challenge

A recent e-mail from our colleague Jim Coplien (aka "Cope") reads:
> From: "Jim Coplien" <cope@research.bell-labs.com>
> Date: Tue, 22 Dec 1998 13:03:56 -0600
> To: Lalita Jagadeesan <lalita@research.bell-labs.com>, god, tball
> Subject: a program for your flow and testing tools
>
> /*
> * seriously -- run it :-)
> */
> #include <stdio.h>
> main(t,_,a)
> char *a;
> {
> return!0<t?t<3?main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)):
> 1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?
> main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,
>"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l+,/n{n+,/+#n+,/#\
> ;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \
> q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;#\
> ){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \
> iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
> ;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \
> }'+}##(!!/")
> :t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1)
> :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,
> "!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);
> }
Here is the output of the program, which is written in the C language. We know what the program does. The question we will address is how it accomplishes the printing of the poem.  Of course, we easily could write our own program to print the "Twelve Days of Christmas", but that's sort of like going out to buy yourself a gift you know someone has already purchased for you.  Also, in reverse engineering the program, we wish to maintain as much of the original program's computational signature as possible. Whenever possible, we will rewrite the program in the spirit of the original program, rather than substituting a radically different piece of code in place of one we don't happen to like.

For those of you who can't wait until Christmas day to open your presents, here is the reverse engineered program.


A Formal Model of The Twelve Days of Christmas 

Before taking on a reverse engineering task, it is important to have some model in mind to help guide the process. The "Twelve Days of Christmas" is all about gifts and counting (so appropriate for the 90's!), so we approach the poem by identifying various quantities that arise from the poem's natural structure.

As is clear from a cursory reading, the poem takes O(N*N) time to read and O(N*N) space to write, where N is the number of gifts (12, to be precise).  We prefer to derive an exact count of the number of times gifts are mentioned in the poem, which is quite straightforward. A gift with ordinal value t is mentioned 13-t times in the poem. For example, "five gold rings" occurs 13-5=8 times.  Summing over all gifts yields 1+2+...11+12 = 13*6 = 78 mentions of gifts in total.  Of course, the partridge is in all the verses. Subtracting its count (12) from the total yields 66 mentions of non-partridge gifts.

A commonality and variability analysis (ala Weiss) of the poem indicates that each verse has common beginning and ending strings with variabilities in the day and list of gifts on a given day. A template for the poem that captures this follows:

On the <ordinal> day of Christmas my true love gave to me
 <list of gift phrases, from the ordinal day down to the second day>
 and a partridge in a pear tree.
We note one negative variability (ala Cope) in the poem. The first verse is:
On the first day of Christmas my true love gave to me
 a partridge in a pear tree.
which does not fit the template exactly, for the last line does not begin with the word "and", as specified in the template.

Nonetheless, using the above template, we derive a count of the number of unique strings that a program to print the poem may contain. This would be 3 strings for the common structure ("On the", "day of Christmas...", "and a partridge ..."), 12 strings for the ordinals, and 11 strings for the second through twelfth gifts.  This means that the program might contain approximately 3+12+11 = 26 unique strings and print approximately 3*12 + 12 + 66 = 114  strings.

Finally, a character count of the output of the program reveals that it is comprised of 2358 characters.

To recap, the important numbers we'll use to guide our search are the:


Pretty Printing the Program

To understand a program, it is helpful to be able to read it.  The given program is barely readable, even for those very familiar with the C language. Our first task was to pass the code through a C compiler (in this case, the author) and pretty print it, using indentation and explicit parenthesization to make the code more readable. Here is the result.  It's still not very pretty, as the original author used conditional expressions (e1 ? e2 : e3) and list expressions (e1, e2, e3) rather than if-then-else statements and statement blocks.  So, we took the program and rewrote it without the use of conditional expressions or list expressions to get the following pretty program.  In addition, we eliminated a few pieces of dead code. For example, in the translation of the list expression:
((t < 3
  ? main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a))
  : 1),

 (t < _
  ? main(t+1,_,a)
  : 3),

 (main(-94,-27+t,a)
          && (t==2
              ? ( _ < 13
                 ? main(2,_+1,"%s %d %d\n")
                 : 9)
              : 16)))

the emboldened constants 1 and 3 are dead code, since they are not consumed by any other expression. Thus the first and second expressions in the expression list are translated as:
  if (t < 3)
    main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a));

  if (t < _ )
    main(t+1,_,a);


The Structure of the Program

As we can see, the pretty program consists of one function main which has three arguments and calls itself repeatedly.  It truly is a function, as it does not assign to any variables.  The function main achieves its purpose based solely on the values passed to it.  The first argument passed to main is the count of the number of arguments on the command line (including the name of the program itself). Thus, when the program is run with no arguments, the value of the parameter t will be 1.  There are no explicit loops; looping is achieved through recursion. The selection of different legs of the function seems to be driven by the parameter t. The program contains two large strings, which appear to encode the text of the poem.


Path Profiling the Pretty Program

Recall that we are going to use various counts derived from the output of the program to try and understand the program.  So, we use a profiling tool to capture execution counts of the pretty program. The idea is that these execution counts will help us identify which parts of the program are responsible for which parts of the poem. For example, a program element with an execution count of 11 or 12 may indicate an entity involved in the control of the number of verses, while an element with an execution count of 2358 is most likely involved in printing characters.

In general, a profiling tool adds snippets of code to a program that count the number of times entities in a program (such as statements, expressions, etc) execution in a given.  In particular,  we applied the Hot Path Browser (HPB) tool of Ball, Larus and Rosay to the transformed program. This tool instruments programs to record and display Ball/Larus path profiles. A Ball/Larus path profile counts how many times each acyclic intraprocedural path executes.

The following screen shapshot shows the HPB tool applied to the pretty program. The upper left pane of the tool shows the statistics about each path that executed. By coincidence, exactly 12 paths (out of a total of 24 possible paths) executed!  The paths are listed in ascending order of frequency.  The path with id 13 has been selected (red line). This path is highlighted in the source code view on the right side of the GUI snapshot.

Note how the paths are clustered by frequency. With a manual examination of the code, we identified the computational signature of each cluster of paths.

This analysis revealed 6 major clusters of paths.

 


The Reverse Engineered Program 

Using the knowledge gained from the path profiles and manual examination of the program, we reverse engineered the program.  We strove to keep the recursive structure of the program intact, but to use different functions to represent the different tasks of the original program, as captured by the clustering of the paths.  We did not change the values of the two relevant text strings (the list of strings and the translation mapping for the encoding of the characters) at all.  The original program used the value 2 to represent the first day of Christmas. We shifted down to 1 to match the poem.

There are 7 functions in the new program, corresponding almost exactly to the six clusters of paths identified above:

The new program has the exact same output as the old, and all of the performance disadvantages as well. To show that we have (in some sense) captured the essence of the original program, we path profiled the new program to see if the path frequencies in the new match up with those in the original.  The screen snapshot of the new program shows that its path spectra closely matches that of the original program, although there are 15 paths executed of a total of 17 paths. For most paths, there is a one-to-one correspondence with a path in the original code. For example, the selected path in new screen snapshot (path 1 of procedure inner_loop) has frequency 55. It represents the same computation as path 13 in the old screen snapshot.


Summary 

The obfuscated "12 Days of Christmas" program uses three simple tricks to hide its purpose:

However, what makes the program most difficult to understand is the encoding of multiple "functions" into the single function main. A well known folk theorem in computer science is that any program can be transformed into a semantically equivalent program of one procedure containing one switch statement inside a while loop. The transformation involves rewriting the program so that each case of the switch is an elementary operation of the original program followed by an assignment to a program counter variable that communicates to the switch statement the next operation to execute. The while loop terminates when the program counter reaches the last operation in the program.

A corollary to this theorem is that any program can be rewritten into a program consisting of a single recursive function containing only conditional statements, as is the case with the obfuscated program. The first parameter to the function main takes on the role of the program counter (for certain value ranges, it also specifies which string to print). We used path profiling technology to help separate out the "true" functions that this single function implements, aided by the fact the poem really is nothing more than a profile of gifts received.

Happy Holidays!