Thomas Ball
December 23, 1998
Having received the "Twelve Days of Christmas" obfuscated C program as a Christmas present once too many times with out "opening" it, the author decides to take on Jim Coplien's challenge to reverse engineer it. The use of pretty printing and path profiling technology was found to be a very helpful aide in understanding the structure and the behavior of this program.
> From: "Jim Coplien" <cope@research.bell-labs.com>Here is the output of the program, which is written in the C language. We know what the program does. The question we will address is how it accomplishes the printing of the poem. Of course, we easily could write our own program to print the "Twelve Days of Christmas", but that's sort of like going out to buy yourself a gift you know someone has already purchased for you. Also, in reverse engineering the program, we wish to maintain as much of the original program's computational signature as possible. Whenever possible, we will rewrite the program in the spirit of the original program, rather than substituting a radically different piece of code in place of one we don't happen to like.
> Date: Tue, 22 Dec 1998 13:03:56 -0600
> To: Lalita Jagadeesan <lalita@research.bell-labs.com>, god, tball
> Subject: a program for your flow and testing tools
>
> /*
> * seriously -- run it :-)
> */
> #include <stdio.h>
> main(t,_,a)
> char *a;
> {
> return!0<t?t<3?main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)):
> 1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?
> main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,
>"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l+,/n{n+,/+#n+,/#\
> ;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \
> q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;#\
> ){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \
> iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
> ;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \
> }'+}##(!!/")
> :t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1)
> :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,
> "!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);
> }
For those of you who can't wait until Christmas day to open your presents, here is the reverse engineered program.
Before taking on a reverse engineering task, it is important to have some model in mind to help guide the process. The "Twelve Days of Christmas" is all about gifts and counting (so appropriate for the 90's!), so we approach the poem by identifying various quantities that arise from the poem's natural structure.
As is clear from a cursory reading, the poem takes O(N*N) time to read and O(N*N) space to write, where N is the number of gifts (12, to be precise). We prefer to derive an exact count of the number of times gifts are mentioned in the poem, which is quite straightforward. A gift with ordinal value t is mentioned 13-t times in the poem. For example, "five gold rings" occurs 13-5=8 times. Summing over all gifts yields 1+2+...11+12 = 13*6 = 78 mentions of gifts in total. Of course, the partridge is in all the verses. Subtracting its count (12) from the total yields 66 mentions of non-partridge gifts.
A commonality and variability analysis (ala Weiss) of the poem indicates that each verse has common beginning and ending strings with variabilities in the day and list of gifts on a given day. A template for the poem that captures this follows:
On the <ordinal> day of Christmas my true love gave to meWe note one negative variability (ala Cope) in the poem. The first verse is:
<list of gift phrases, from the ordinal day down to the second day>
and a partridge in a pear tree.
On the first day of Christmas my true love gave to mewhich does not fit the template exactly, for the last line does not begin with the word "and", as specified in the template.
a partridge in a pear tree.
Nonetheless, using the above template, we derive a count of the number of unique strings that a program to print the poem may contain. This would be 3 strings for the common structure ("On the", "day of Christmas...", "and a partridge ..."), 12 strings for the ordinals, and 11 strings for the second through twelfth gifts. This means that the program might contain approximately 3+12+11 = 26 unique strings and print approximately 3*12 + 12 + 66 = 114 strings.
Finally, a character count of the output of the program reveals that it is comprised of 2358 characters.
To recap, the important numbers we'll use to guide our search are the:
((t < 3the emboldened constants 1 and 3 are dead code, since they are not consumed by any other expression. Thus the first and second expressions in the expression list are translated as:
? main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a))
: 1),(t < _
? main(t+1,_,a)
: 3),(main(-94,-27+t,a)
&& (t==2
? ( _ < 13
? main(2,_+1,"%s %d %d\n")
: 9)
: 16)))
if (t < 3)
main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a));if (t < _ )
main(t+1,_,a);
In general, a profiling tool adds snippets of code to a program that count the number of times entities in a program (such as statements, expressions, etc) execution in a given. In particular, we applied the Hot Path Browser (HPB) tool of Ball, Larus and Rosay to the transformed program. This tool instruments programs to record and display Ball/Larus path profiles. A Ball/Larus path profile counts how many times each acyclic intraprocedural path executes.
The following screen shapshot shows the HPB tool applied to the pretty program. The upper left pane of the tool shows the statistics about each path that executed. By coincidence, exactly 12 paths (out of a total of 24 possible paths) executed! The paths are listed in ascending order of frequency. The path with id 13 has been selected (red line). This path is highlighted in the source code view on the right side of the GUI snapshot.
Note how the paths are clustered by frequency. With a manual examination of the code, we identified the computational signature of each cluster of paths.
Using the knowledge gained from the path profiles and manual examination of the program, we reverse engineered the program. We strove to keep the recursive structure of the program intact, but to use different functions to represent the different tasks of the original program, as captured by the clustering of the paths. We did not change the values of the two relevant text strings (the list of strings and the translation mapping for the encoding of the characters) at all. The original program used the value 2 to represent the first day of Christmas. We shifted down to 1 to match the poem.
There are 7 functions in the new program, corresponding almost exactly to the six clusters of paths identified above:
The obfuscated "12 Days of Christmas" program uses three simple tricks to hide its purpose:
A corollary to this theorem is that any program can be rewritten into a program consisting of a single recursive function containing only conditional statements, as is the case with the obfuscated program. The first parameter to the function main takes on the role of the program counter (for certain value ranges, it also specifies which string to print). We used path profiling technology to help separate out the "true" functions that this single function implements, aided by the fact the poem really is nothing more than a profile of gifts received.
Happy Holidays!