Finding Binary Clones with Opstrings & Function Digests: Part I
By Andrew Schulman
Dr. Dobb’s Journal, July 2005
Stop me if you heard this one. It’s his first night, and the new guy hears the other inmates shouting out numbers. “1375” shouts one; the others explode with laughter. Another yells “3811,” and there are chuckles of recognition. The new guy asks his cellmate what’s up, and learns that everyone else has been there so long, they know all the jokes by heart, and rather than repeat them, they just assigned them numbers, and now just tell the number instead of the joke. Next night, trying to fit in, our hero pulls a number off the top of his head, shouts out “2342” – and it falls flat. Nothing, it’s dead out there tonight. His cellmate informs him that he told it wrong, or that some people don’t know how to tell a joke, or that no one gets it or…, well, you get the idea. Same joke really, regardless of the punchline.
In this joke, a number arbitrarily assigned to a joke is somehow as expressive as the joke itself, like humming along to the hexdump of an MP3 file. The joke also suggests that a joke (or a story or article) is probably not very unique. It’s just a number whatever. Folklorists have standard indices of motifs and tale types in which, for example, “Cruel Stepmother” is S31, “magic object received from fairy” is D813, and “identification by fitting of shoes” is H36.1. Though not by design, traditional stories are put together out of such motifs, which are, in a sense, low-level design patterns.
And so, to software: It would be useful, when looking at a program, to immediately know which parts of it are truly unique to that program, and which are “boilerplate”: stereotyped material that appears, more or less verbatim, time and again. Much software is cloned from other software. One study found even expert programmers producing an average of four code “clones” per hour (http://tlau.org/research/papers/ M.Kim-EthnographicStudyofCopyPaste- ISESE.pdf). If each common low-level design pattern had an assigned number, we could answer questions about the code’s uniqueness or lack therefore, about how self-similar it is (that is, how much of the code appears more than once in the entire system), or about how one part of it relates to another, or how one version relates to the previous version.
It should be possible to identify software clones and boilerplates, even without the source code. Take a Microsoft Windows XP CD. Microsoft says that XP consists of 45 million lines of code. You probably don’t have the source code, but this CD contains about 310 MB of binary code files. (The binary code on the CD relates to the source code, perhaps as the number of a joke relates to the joke itself.)
It sounds like questions about a program should be unanswerable without the source code. Because the source code is boiled away, as it were, in the process of compilation, trying to recover the source code from the binary program would be like trying to reconstruct a cow out of a hamburger. One of the premises of the Open Source movement is, of course, precisely that something is missing (“Where’s the beef?”) when programs are distributed without their source code.
But given that the processor actually has to run the binary code, this code can’t — at least in a practical, good-enough sense — be all that much of a closed book, at least, any more than source code often is (on the false dichotomy between source and object code, see http://www-2.cs.cmu.edu/~dst/DeCSS/object-code.txt). In fact, examining binary object code may sometimes have advantages over examining source code.
The goal of this and subsequent articles is to construct a tool for building a database of function signatures or fingerprints as found in Windows XP and in major Windows applications such as Microsoft Office. A similar database has been proposed for Java code (http://citeseer.ist.psu.edu/baker98deducing.html). “Signature” here refers not in a C++ sense to the specification of a function, but rather to some characterization of the implementation of a function. With such a database, we should be able to:
- Automatically produce names for boilerplate functions (startup code, runtime library functions, MFC code, and so on) found in disassembly listings.
- Focus copyright-infringement disputes involving binary code by filtering out expected similarities in boilerplate code.
- Pinpoint the migration or “commingling” of code from one file to another, and help trace the evolution of a file from one version or patch to the next.
- Find candidates for placement in a DLL or in the OS by finding functions that are duplicated in many different programs.
- Find useful functions in DLLs that ought to be documented by locating functions in EXE files that duplicate those in DLLs.
- Create graphs of the system by grouping together files that share a large amount of nonboilerplate code.
- Measure the self-similarity of the system (amount of code that appears more than once in the system).
- Measure the uniqueness of the system (amount of non-boilerplate code that doesn’t appear elsewhere).
- By identifying duplicated functions, find candidates for refactoring (and do housecleaning after a cut-and-paste binge), and locate possible errors (it has been found that even harmless redundancies “strongly correlate with the presence of traditional hard errors”; see http://www .stanford.edu/~engler/tse-redundant.pdf.
- And, for better or worse, given knowledge of a security vulnerability, locate functions in other files that would contain the same vulnerability. (For a superb example, see http://www.sabre-security.com/files/dimva_paper2.pdf.)
These articles focus largely on creating a tool that lets you do these things; the actual “doing” appears in a subsequent article. However, a few glimpses of function database applications are in order.
Figure 1 is an excerpt from a function database built from all of the Win32 code files on the XP CD. Using Microsoft’s symchk utility, PDB files were downloaded from the Microsoft Symbol Server for only a few of the files; a PDB provides names of internal functions in its corresponding Win32 code file. Here, there’s a routine named _SHATransformNS@8 in dlimport.exe, dxmasf.dll, and (toward the bottom of Figure 1) wmvcore2.dll. The second column shows function signatures, and the many identical signatures in the second column show that over 30 files in the system contain this same piece of code. For example, the function at address 58F4C6BD in this version of wmvdmod.dll is _SHATransformNS@8. (Many of these files have been identified as comprising Windows Media Player; see http://www.microsoft.com/mscorp/legal/eudecision/faq.asp.) The entire database of over half-a-million signatures was sorted (in reverse numeric order), so that matches like this would jump out from even a glance at the database file. The first column shows the size of the code, though not in bytes, but in terms of those elements of the code (the choice of which is the main subject of these articles) used to construct the signature; the binary code happens to be 2882 bytes and disassembles to over 1000 lines of assembler code.
Clearly, this code is part of the Secure Hash Algorithm (SHA). (It is probably fairly similar to the code at http://www.cc.utah.edu/~nahaj/c/sha/sha.c.html, except that a brief inspection of the Microsoft code shows that the numbers 0x5a827999, 0x6ed9eba1, and 0x81f1bbcdc appear over and over – a sure sign that the loops have been unrolled, which explains the very large size of this function.) Apart from showing that this code is duplicated 30 times within XP, Figure 1 also illustrates how, with a function database, debug symbols for one file can provide a name for a function in an entirely different file.
As another application, I queried for signatures representing large functions residing in both EXE and DLL files. This reveals functions in DLLs whose general usefulness is indicated by the same code’s appearance in EXE files; but the EXE has its own complete copy of the code instead of linking to the DLL. Figure 2 shows one example.
There are routines inside untfs.dll, identical copies of which appear in the three EXEs; this suggests that these routines probably ought to be exported from the DLL, documented, and/or linked to by any application files that need this functionality. (Untfs.dll contains functions related to the NTFS filesystem; the three EXEs are filesystem-conversion utilities intended to be run early in the OS boot sequence, possibly before DLLs can be loaded.)
In my work, I’ve used the techniques described here to help uncover possible copyright infringement by first filtering out noninfringing similarities. If you compare every line in every file of even the most unrelated and noninfringing C/C++ source code projects, you will find matching lines such as return 0; or int i;. In Windows source code, you would find identical boilerplate references to wndclass.lpszMenuName = szAppName;. Such boilerplate code must be ignored when comparing source code. There’s an analogous situation when comparing binary code: You need to filter out startup code, runtime library (RTL) code such as printf, wizard code, and so on.
Having an extensive function database makes it easy to create this exclusion list. This database can be created in part simply by compiling one’s own programs with different compilers and linkers, different optimizations switches, and so on, and with debugging symbols enabled, so that the function database contains meaningful names.Figure 3 is an example of some Microsoft C RTL function signatures.
Were functions with these signatures encountered in the course of a binary copyright-infringement examination, you would know to exclude them from consideration. They are boilerplate code.
Signatures like those in Figure 3 can also be used when disassembling or debugging. The bold line means, for instance, that any function whose signature (computed as described in these articles) is 8D71CE1651AD88B045B36FB9EBE0109F, is in fact a copy of a particular Microsoft implementation of _strlen. All the functions in Figure 4, otherwise nameless, are now known to be _strlen.
This same idea of using function signatures to provide names for disassembly and debugging is used in IDA Pro’s “FLIRT” (Fast Library Identification and Recognition Technology; http://www.datarescue.com/idabase/flirt.htm), and has been discussed in the context of code decompilation (http://www.itee.uq.edu.au/%7Ecristina/tr-2-94-qut.ps). When reverse engineering a program, knowing that some piece of code is actually _strlen or _printf will usually mean that the code need not be disassembled. Reverse engineering is at least as much about knowing what to ignore as it is about taking stuff apart.
Assisting with code disassembly is only a minor side benefit of the database discussed here, however. Our real goal is to create high-level models of systems such as Windows — models that describe what parts of the system most closely belong together, for example — based on nothing more than its binaries. A disassembler is run in the course of building the database, but its output is used by another program and then discarded, not viewed by a human.
File Fingerprints
Where will the signatures in a function database come from? Of course, the string of bytes that comprise a file can easily generate a hash or digest for that file, using an algorithm such as MD5. This digest is typically much smaller than the file itself (hardly surprising in a digestive process), yet generally works as a fingerprint or signature uniquely representing the much larger file. “Generally,” because MD5 collisions have been found such that two different pieces of data can produce the same digest (http://www.doxpara.com/md5_someday.pdf).
Any program that generates MD5 hashes will report that C3C3864DA698F0CC1BE56F9695534DD8 is the digest for a 421-KB Windows file, \windows\system32\LegitCheckControl.dll, Version 1.0.0132.4. Sure enough, typing “C3C3864DA698F0CC1BE56F9695534DD8” into Google results in several hits to pages mentioning LegitCheckControl.dll. (In the absence of a central Microsoft server of file signatures, Google acts as a decent one.) The 32-character MD5 string is a boiled-down representative of the 421-KB file.
Canonically, any MD5 program reports that D41D8CD98F00B204E9800998ECF8427E is the digest for any 0-byte file. Typing “D41D8CD98F00B204E9800998ECF8427E” into Google results in thousands of hits to web pages that note this is the digest of the zero-length string. (Though the Solaris Fingerprint Database reports that 10,247 different files match this fingerprint.) Empty files, like happy families, are all alike.
If you run a utility such as Microsoft’s File Checksum Integrity Verifier (fciv) on the \windows\system32 directory, it spits out an MD5 for every file in the directory. The MD5 appears on the line before the filename, so the resulting output can be sorted by MD5 rather than by filename. Glancing at the sorted output reveals any matching MD5s, which represent duplicate files appearing under different names, as in Figure 5.
This is somewhat useful, but the slightest difference results in a nonmatch. Comparing MD5 file digests is like a straightforward binary comparison with the cmp utility, except that MD5s are more easily transported than the files themselves.
As a workaround for this over-exactness, Microsoft has suggested running its DumpBin utility with command-line options that ignore date/time stamps, and then comparing the resulting text output with a diff utility (http://support.microsoft.com/kb/q164151/). Microsoft also has a BinDiff (one of many programs with this same name) for this same purpose. However, these methods still treat the file as one big clump.
Any binary file can be turned into some sort of a text file (even if only a hexdump), then examined with a text utility. You may therefore want to know why, if we want to compare DLLs or EXEs, we don’t use diff or some other text-based tool. One reason is that diff compares entire lines, and (for reasons that will soon become clear) when comparing textual representations (such as disassemblies or hexdumps) of code files, we would only want to compare parts of a line, ignoring the rest of the line or treating it as a wildcard. A selective diff tool is easily built with languages such as Perl or Awk. A tool to slice a text file (extracting, say, only fields 3 and 4 from every line) is also easily built, and would help construct another tool that, indirectly via disassembly listings, allows two DLL or EXE files to be compared. Also, there are already binary diff utilities that will try to find the smallest possible set of edits that would turn one binary file into another (presumably a later version of the first); this idea is used, for example, in PocketSoft’s RTPatch (http://www.pocketsoft.com/whitepapers/rtpwhite.pdf).
However, the goal here is not to find either the differences or the similarities between two files. In essence, the goal is to know, for a given file, which parts of it also appear in any of thousands of other files. This is not a job for diff or cmp.
Function Fingerprints
If digests can be created for entire files, then why not for the parts of a file? The file’s digest, then, would represent (though isn’t computed from) the set of digests of its component parts.
Nearly everyone with a computer, if only because they have either felt or been compelled to turn over their credit-card number to receive an antivirus update, is familiar with the idea of virus signatures (http://ftp.cerias.purdue.edu/pub/papers/sandeep-kumar/kumar-spaf-scanner.pdf and http://www.cs.wisc.edu/wisa/papers/ issta04/issta.pdf). Often these signatures are verbatim transcriptions of a portion of a virus, where that portion (which may be data, such as a malicious text message rather than code), hopefully belongs uniquely to the virus and nothing else. e7iqom5JE4z is part of most signatures for the AnnaKournikova virus, for instance, because that is the name of a function in the virus’s VBScript code. Increasingly, virus signatures are less literal, allowing for wildcards in the match.
Rather than locate an infection in a file, we will want to break a file down into its component parts and identify each of these parts. Still, there is some similarity between function signatures and virus signatures, particularly those that focus more on a virus’s structure than on its literal bytes. It is sad that, of the little empirical study of software that takes place (our field is much more focused on adding to the pile than on understanding what we already have), so much of it is focused on the viral work product of underemployed 17-year-old boys. One of the goals of the function database is to show that antimalware techniques can be extended to examine plain commercial code.
There are numerous ways that we might split a binary file into component parts. As one example, an article by Kris Coppieters describes a BinDiff utility (Dr. Dobb’s Journal, May 1995) that compares two binary files based on a dynamically selected file delimiter. This idea might be adopted so that, in a Windows executable, bytes corresponding to common instructions, such as MOV or PUSH, might make useful delimiters, marking off sections of more-interesting code in the way that ‘\n’ marks off lines in a text file; the byte representing a function return (RET) might also be used as a delimiter.
However, when dealing with Windows software, the structure of the file is already known. Windows binary software resides in Portable Executable (PE) files. The file format is documented (http://www.microsoft .com/whdc/system/platform/firmware/ PECOFF.mspx), and is supported by numerous tools, including the one I’ll rely on here, Clive Turvey’s DumpPE (http:// www.tbcnet.com/~clive/dumppe.htm). DumpPE includes a disassembler that locates function starts (though only to a lesser extent, function ends) in a Windows program, and outputs the code for that function.
What we want, it seems, is simple – an MD5 digest for each function in a file. (I ignore the file’s data, such as strings, resources, and so on.)
As stated, though, this would be of little use, because the code in a binary program is usually position dependent. For example, a compiler and linker may transform the simple C source code for func in Figure 6 into the Intel x86 code represented in either Figures 7 or 8, depending on the location of func within the entire program; this location depends on factors unrelated to the contents of func.
The second column of these disassembly listings (produced with DumpPE) shows the literal bytes of code that will be executed by the microprocessor. While a brief glance shows that Figures 7 and 8 are, as you would hope, darn similar, the bold areas show that the same source code in Figure 6, compiled in the same way with the same settings, results in slightly different bytes of code, depending on the function’s location in a file.
Forget that you’ve seen the source code; assume the binary code is all you have. And try to ignore the fact that it’s easy, when you’re looking right at them, to see the similarities between Figures 7 and 8. Think, instead, of the code in Figures 7 and 8 appearing as only two out of, say, three-million pieces of code. How then would you match them up?
Were a string to be constructed from the literal bytes in Figure 2 (55, 8BEC,51,6A01,6830504000,…), it would differ from one constructed from the literal bytes in Figure 8 (55,8BEC,51,6A01, 6838894000,…). MD5 digests generated from these strings would, of course, also differ. Digesting such strings, therefore, would not be helpful in attempting to ID either function against a database of known functions, or to match the two functions with each other.
What is wanted, then, is a way to characterize a function, without using the exact bytes that make up the function. Ideally, you would like the smallest possible representation of the function that uniquely distinguishes it from all other functions, but that at the same time encompasses minor variations of itself.
Calling an MD5 digest a “signature” or “fingerprint” is a misnomer in a way because, while smaller than the original, it is boiled down or digested from the entire original, all of its parts; whereas we normally think of a signature or fingerprint (except perhaps a so-called “DNA fingerprint”) not as some sort of homunculus of its owner, but as a small subset that stands in for them.
As one example of a subset that represents the whole, in a copyright-infringement case years ago involving software for Windows 3.x, I used the sequence of Windows API calls made by a function. While Windows API usage is heavily stereotyped with a small subset of the entire API set employed over and over in expected ways, in this case, one program used a sequence of obscure and/or undocumented Windows API calls, and it could be seen from the import table of the other file, even without disassembling, that it too used this same highly unusual sequence of API calls (http://www.sonic.net/~undoc/apple_ms.txt). In other words, the sequence of API calls issued by a function worked as a signature or fingerprint for the function itself. It would be too limiting to restrict ourselves to Windows API calls as the basis for function signatures, but this shows that some subset of a function can be extracted to characterize the function itself.
Besides sequences of unusual API calls, a function might be characterized using sequences of unusual assembly language instructions, or typical instructions used in unusual ways, or simply by ignoring the handful of instructions such as MOV that make up the bulk of code. Other techniques focus on the branch targets in a function, or the branches and calls that it makes. The general point is to move as far away from the literal binary code as possible, to avoid false negatives (missing cases where the overlying source code is the same or a trivial cut-and-paste modification), but not so far that there are unacceptably many false positives (ostensible matches found even when the overlying source code does not match).
We want the function’s shape, its edges, not the function itself. As a hint of how I do this in the next installment of this article, look back at Figures 7 and 8, but try looking at it from a different perspective. Think about slicing a section through the code.