Finding Binary Clones with Opstrings & Function Digests: Part III
by Andrew Schulman
Dr. Dobb’s Journal, Sept. 2005
Can a few extracts from a piece of binary code be sufficient to identify that piece of code in a large database? As discussed in the two previous parts of this article, a piece of binary code’s noteworthy features can be extracted and placed into an opstring, an ordered set of instruction mnemonics or operations (“ops”). Noteworthy features include atypical instructions, system or API calls, branch-target locations (places the code may jump to), and the use of unlikely “magic” numbers.
In x86 Win32 code, one such opstring would be: “test, jz, call, test, jz, call, ret, loc, [HeapFree], loc, ret”. Using an algorithm such as MD5, this can be turned into a function digest such as 65E482BF6A2F391D84D243D9E02244D7. If the function’s name is known, either from debug symbols or from a Win32 DLL export table, the function digest and name can be added to a function database. The “test,jz,...” example and its MD5 signature happen to correspond to one implementation of the free C runtime library (RTL) function. If, in some other program, a function were encountered with a digest of 65E482BF6A2F391D84D243D9E02244D7, it could with a fair degree of certainty be identified as free. How much certainty is one of the subjects of this third and final article on binary code lookup and comparison.
Generating additional symbolic information for otherwise anonymous code — using symbols from one file to ID code in a different file — can tell you, when staring at code in a debugger, that some function is the free memory-deallocation function; this beats trying to read and understand the code.
However, that is a fairly limited use. Given a large database with MD5 signatures for every function in every Win32 file preinstalled on a typical PC or on the Windows XP CD, you can measure the amount of code duplication in the system, classify modules by sorting them into sets based on how much code they share in common, identify code that should perhaps be moved into a shared library, or find multiple instances of flawed code.
Excluding Boilerplate
I had a different purpose when I started this project: Not to use or study the contents of the function database, but instead to filter it out as noise. The function database was intended merely as a list of “boilerplate” code — stereotyped verbiage — that could be ignored when comparing binary modules in copyright infringement litigation.
Almost all litigation happens outside a courtroom. At least in the U.S., the majority of legal cases never reach trial (the system would collapse if more than a handful did). Cases are settled based on each side’s estimation of the “value” of the case, determined in part by what each side learns about both sides to the case. Modern rules of “discovery” require each side to turn over to the other side (“produce”) certain types of information, such as potentially relevant internal corporate e-mails and other documents (for typical examples, see http://www.usdoj.gov/atr/cases/ms_exhibits.htm).
In software copyright litigation, this includes each side’s source code. Each side’s expert compares the two sets of source code, looking for literal similarities that shouldn’t be there, and for signs of nonliteral copying. (Each source tree may consist of millions of lines of code, so this process must be automated. It’s a bit harder than it sounds. Hints: diff is not the right tool; associative arrays are nearly essential.)
Even open-source software is faced with issues like these: What percentage does GNU represent of a typical Linux distribution? Is a commercial vendor reusing open-source code without abiding by the GPL (see gpl-violations.org)? Has a contributor used copyrighted code?
Binary comparison is useful in part because source is sometimes unavailable. Not surprisingly, litigants sometimes resist turning over their source code (“the crown jewels”) to the other side, even with court protective orders limiting who may see the source code. Also, a litigant only sees the other side’s source code after it becomes an actual litigant. Thus, a software copyright complaint will generally be issued before the plaintiff has seen the defendant’s source code. In the U.S., complaints can be filed “upon information and belief,” and can later be amended. Still, a plaintiff should try confirming their suspicions. Meanwhile, a potential defendant might also have concerns (“I wonder if my programmers have been ‘borrowing’ code from their previous jobs”). If and when both sides’ source code becomes available, the binary comparison is a double check, and may pinpoint specific source-code areas to investigate.
When detecting similarities in code — be it source or binary — there must be a baseline for comparison. Comparing any two otherwise-unrelated source-code trees, written in the same programming language and/or targeting the same operating system, yields some exact matches that nonetheless don’t reflect copying, such as return 0; and for (int i=0; i<count; i++) in C code. Such lines need to be excluded from any source-code comparison. They are, in the context of such a comparison, noise, mere boilerplate: Everyone uses these phrases, so their presence is uninformative. (However, while boilerplate code is generally excluded from such percentages, boilerplate code that is merely a small part of a larger block of nonboilerplate code is generally included in the overall matching percentage.)
You generally “normalize” each line of source code by ignoring all white space (though in one case, a developer had used combinations of spaces and tabs as Morse-coded copyright notices). If you suspect juvenile forms of plagiarism, you might normalize all variable names; one of the benefits of comparing binary code is that the compiler eliminates names for you. You probably would not exclude comments because shared misspellings and errors may indicate copying (see Bob Zeidman’s “Detecting Source-Code Plagiarism,” DDJ, July 2004).
Lines can be normalized as each line of code is inspected, but filtering out boilerplate is different: You can’t tell, just by looking at it, that something is standard verbiage. You need a boilerplate database, constructed at some earlier point by running through lots of other code — the baseline code you’re not going to compare — and pulling out the most frequently occurring lines, like int i; and return 0.
Similarly, binary-code comparison needs a baseline — a database of boilerplate functions that, if found in the programs under comparison, should be ignored. The original purpose of Opstring was to build this database, by generating MD5 digests for every function included with Windows XP. This database contains MD5 signatures for C RTL functions, compiler startup code, standard sample code from Microsoft’s software development kits, Visual Studio “wizard” code, and so on. Microsoft’s debug symbol packages (see http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx) provide names for the functions in nearly every XP DLL. Even without these names, while I don’t necessarily know what each function is, my binary-code copying-detection program knows to ignore any copies of that function encountered in the two sets of binaries under comparison. (Additional code may need to be excluded from consideration, even if it’s not found in the standard function database; see Computer Associates v. Altai.)
Does “Structurally Similar” Really Mean “Clone”?
As noted, the function database contains MD5 digests of opstrings, each of which is a string of the noteworthy features of a function. For example, the C source code in Figure 1(a), compiled to produce something like the code in Figure 1(b), yields the opstring in Figure 1(c) and the MD5 digest in Figure 1(d).
The instruction mnemonics have been extracted from the disassembly, the common mov, push, and pop instructions discarded, and the remaining mnemonics placed into the string, along with “loc” pseudo-ops, indicating the targets of the jz and jnz instruction. The operands have been ignored.
While Figure 1(c) plainly abbreviates the code in Figure 1(b), it is less clear that Figure 1(c) truly represents Figure 1(b), much less its source in Figure 1(a). After all, wouldn’t a lot of other code, looking nothing like Figure 1(a) or Figure 1(b), also abbreviate to Figure 1(c)?
It turns out that the answer is no in this particular case. A database of about 775,000 function digests, representing most of the binary code in Windows XP SP2, did not contain a match for the function digest in Figure 1(d). Even having discarded so much of the function, what’s left are reasonably reliable indicators of the function’s identity.
Knowing that the opstring algorithm ignores mov, push, and pop, you could easily construct a function that differed from Figure 1(a) but that yielded the same opstring as in Figure 1(c). For example, Figure 1(e) appears to be a false positive, but only under a literal definition of “clone,” in which even obvious cut-and-paste variations are treated as nonmatches. As noted in Part II of this series, Opstring was written to ignore minor differences. In other words, this is a feature, not a bug. It is important to have a uniform method for generating the function database, but a testing version of Opstring (available electronically; see “Resource Center,” page 3) does provide command-line options to control what is included or excluded in an opstring, allowing for different definitions of “clone.”
If two opstrings match, then — given a minimum opstring length — the binary code from which they were generated (and, to a lesser extent, the overlying source code) will be sufficiently similar that someone eyeballing them would regard them, perhaps not as identical, but as a good match, such that one could be a cut-and-paste variation of the other. In other words, we are backing into a working definition of “clone,” based on what this tool can find. A binary-code “clone” can be defined as a match found by Opstring.
Obviously, not all tools are worth listening to. A “tool” that hashed any piece of code, of whatever length, to a single byte, would “tell” us which of the only 256 types of code in existence a given function fell into. Like classifying people by their astrological signs, it would provide only the illusion of usefulness. There’s a similar danger here, because we’re usually looking at compiler output (rather than hand-written assembler), and one can expect great regularity, even given infinite possible source-code input, because the compiler acts as a funnel, narrowing infinite varieties of source into a relatively small number of microprocessable structures.
Testing for False Positives
Where symbolic names are available, one test for false positives checks for ostensibly matching functions that have significantly different names. Multiple function names mapped to the same function digest (MD5 signature) may indicate that the opstrings from which the signatures are created are not good indicators of the function’s identity. In my first test, nearly half of the named signatures appearing more than once had more than one name. However, browsing through the test program’s output showed that nearly all the mismatches were C++ mangled/decorated names that differed from each other in only minor respects, such as ?Unlock@Perimeter@@QAEXXZ versus ?Unlock@RWLock@@QBEXXZ, or ?Remove@foo versus ?Update@foo versus ?Set@foo. These functions tended to be shorter than the system average. In many cases, the seemingly mismatched names were all located in a single file, indicating that indeed the functions were related, despite the different names. I also found many cases where, despite having used Microsoft’s symchk utility, PDB symbols and code were misaligned.
When I reran the test, this time looking only for cases where code appears in more than one file and where a given symbol is used in more than one file (to eliminate PDB misalignments), ignoring all nonalpha characters, and looking only at the first five characters of the function name, the result was still that almost 3 percent of the signatures had one or more names that apparently mismatched the others in its first five characters.
Of the apparent mismatches, many were C RTL functions, such as isdigit, islower, and ispunct mapping to the same signature. That Opstring currently treats, say, islower and isupper as the same function may reduce its ability to supply exact symbolic names to disassembly listings, but does not undercut its ability to find code clones. Figure 2 shows what some of the remaining apparent mismatches look like (the first column is the opstring length).
Figure 2: Some function signatures with different names in different files
I did find one genuine mismatch, so surely there must be others, especially in signatures based on shorter opstrings. Figure 3(a) shows the offending opstring, Figure 3(b) lists a few of the 30 functions whose code abbreviates to this opstring; Figures 3(c) and 3(d) are two examples of the actual code. Even here, given that the code in Figure 3(c) is called “Cleanup” and calls CoTaskMemFree twice, and that the code in Figure 3(d) calls C++ operator delete (“YAXPAX” in Microsoft C++) twice, the functions perhaps aren’t totally unrelated. But clones, probably not.
Even though many of the mismatches are only apparent, some clearly are real; the 3 percent given earlier is probably a reasonable upper bound on the false-positive rate. This was based on a minimum opstring length of 12, and the false-positive rate can be reduced simply by requiring longer opstrings. A minimum length of 40 brings the false-positive rate down to about 1 percent.
This raises the question of the ideal minimum length for an opstring. The Opstring program discards any function whose opstring is shorter than the desired minimum. The average opstring length in XP is about 55. As seen in Figure 1, even a fairly small C function yielded an opstring whose length was 15. As a rough back-of-the-envelope figure, say there are about four “ops” per line of nonblank, noncomment C code. A minimum opstring length of 20 or 25 then represents a function definition, curly braces, some variable definitions, and five or six lines of actual code.
False Negatives
Different compilers can take the same source code and produce fairly different binary code. The same compiler, with different optimization settings, can do the same thing. Likewise, a single piece of source code that uses preprocessor macros or C++ templates can generate disparate binary code. Given a template<class T> class Stack, with Push and Pop methods, consider the difference between Stack<int>, Stack<float>, and Stack<class Foo>. The developer wrote Push and Pop once, but the compiler generates three different pieces of code for each. To still detect the underlying similarities between these three different binary versions of Stack::Push, based on the instruction mnemonics, would likely require boiling away so many differences among the three versions that unacceptably high false positives would result.
One approach to reduce false negatives is to match on something other than function boundaries. In addition to using branch targets for the “loc” pseudo-op, they could act as delimiters for the blocks of code from which to generate opstrings. With smaller blocks of code to consider, naturally more matches would be found. But considering each branch target as a separate block of code will result in many blocks of code that fall below the minimum length that avoids false positives. One solution is to consider both functions as a whole, and the branch-target blocks within them.
Another approach is to ignore boundaries entirely. The self-similarity within a file or group of files found using the DotPlot method (see “Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code,” by Kenneth Church and Jonathan Helfman; http://imagebeat.com/dotplot/rp.jcgs.pdf). DotPlot is typically a visual technique in which duplication appears as diagonal lines, but the diagonals can be produced nonvisually (and thus selected by another program) by comparing every element of a string with every other element, using nested for loops. Church and Helfman describe several easy-to-implement optimizations that allow most comparisons to be skipped, by ignoring high-frequency tokens (much as Opstring ignores high-frequency instruction mnemonics). The frequency with which tokens appear can also be used to do weighted matching.
All of this can be applied to binary code, once this code has been turned into some form of text, like an opstring. One of the benefits of opstrings, apart from generating MD5 digests, is that they allow textual methods such as DotPlot to be applied, in effect, to binaries. Using the DotPlot method and a hacked version of Opstring, I located large tracts of duplicated code, crossing function boundaries, in Windows programs. However, even with the Church/Helfman optimizations, the program was significantly slower than comparisons with MD5 digests.
Apart from redefining what gets matched with what, we can also allow nonexact matching. By modifying the Levenshtein edit-distance algorithm (see “Finding String Distances,” by Ray Valdes; DDJ, April 1992) to work with ops rather than characters, we could find identical functions by finding those whose opstrings have an edit distance of zero, then locate slight variations by selecting those pairs with small edit distances, for instance, only a small number of insertions, deletions, or substitutions needed to turn one opstring into the other. As with the DotPlot method, this too is significantly slower than comparing MD5 digests.
Implementation
The Opstring program (see opstring) is implemented in the AWK programming language, which simplifies text manipulation (if desired, the AWK code can be translated into Perl with a2p). Even though Opstring digests binary files, it’s a text-manipulation tool that depends on an external disassembler, DumpPE. AWK splits each line of text input into space-delimited fields, designated as $1, $2, and so on. The entire input line is $0, and the number of fields is NF. In DumpPE disassembly output such as in Figure 1(b), opcode bytes appear in $2, the instruction mnemonic in $3, and operands in $4. A function label or branch target appears in $2. To aid reading the code listing, Figure 4 presents pseudocode.
When Opstring sees a function boundary, it outputs the previously collected filename, function name, and opstring, then sets up for the next opstring. When Opstring sees an instruction, it adds it to the opstring. All else is commentary.
Opstring could work directly on PE files, but it makes more sense, and not only for ease of implementation, to simply create a wrapper around an existing disassembler, DumpPE (http://www.tbcnet.com/~clive/dumppe.htm). While Opstring is currently overly tied to the DumpPE output format in particular, and Wintel code in general, there is little reason it couldn’t be generalized for other disassemblers and other instruction sets. (Java and .NET IL come to mind, though there are decent decompilers for both of these, allowing somewhat higher level comparison methods.)
Microsoft’s dumpbin is an obvious choice. Its -disasm option produces reasonable disassemblies of Win32 executable files. Newer versions use symbols in a PDB file. However, its output format is inconvenient to work with (each opcode byte is output followed by a space, making the location of the instruction mnemonic difficult to locate), and more importantly it does not identify branch targets. To use Opstring with dumpbin output, which is similar to that of the Linux objdump utility, I’ve written a converter (available electronically) that makes a first pass over a listing, looking for calls and jumps, to construct a pseudosymbol table that is then used to output function and branch-target labels on the second pass.
IDA Pro from DataRescue (http://www.datarescue.com/) is the gold standard in disassemblers. Its numerous features already include one of the side benefits that I’ve listed for opstrings and function digests — identification of runtime library functions. IDA Pro’s FLIRT (“Fast Library Identification and Recognition Technology”) identifies functions using the first 32 bytes of a function, marking all variant bytes, and a CRC of the remaining bytes. Provision is made for the fact that functions such as _strupr and _strlwr share identical code. An IDA Pro external plug-in could presumably employ a database of opstring-style function signatures.
However, IDA Pro is designed largely for interactive use (it even includes a runtime debugger to help with encrypted or obfuscated code that is difficult to untangle when looking at a static listing). To build the database itself, however, which is the purpose of Opstring, interactivity is less useful. Thousands of disassemblies are generated, briefly used by another program, and then discarded. The end result is the function signature.
Clive Turvey’s DumpPE is a good choice as the supplier of function names, branch targets, and instruction mnemonics to opstring. It makes two passes over the code to find branch targets, uses symbols from a PDB file, identifies functions and branch targets that are indirectly accessed via pointer tables, and identifies the names of Windows API calls.
A key problem is finding the start and end of each block to be compared, which generally will be a function. Opstring relies on DumpPE to find function starts; DumpPE gets these from entry points (especially from the PE file’s export table), from any debug symbols, and from its first pass through the code.
Finding function ends is more difficult; this is the halting problem. As seen in the pseudocode, the program doesn’t find function ends at all. It simply relies on seeing the next function, or the end of the disassembly. Function returns or unconditional jumps do not necessarily signal the end of a function, because code often contains multiple returns or jumps per function.
Because Opstring continues to add to the current opstring as long as it thinks it is inside the same function, it is crucial to filter out as much junk as possible. As seen in the pseudocode, opstring ignores NOPs, debugger invocations (INT 3), data that looks like code, and anything that will take the opstring length beyond a reasonable length. Any opstring that doesn’t contain at least one jmp or ret is discarded, and any opstring of the maximum length is chopped back to the location of its first jmp or ret.
When Opstring has generated the opstrings for a file, they can be turned into MD5 digests with the MkMD5Db program (available electronically). This is little more than a wrapper around the standard MD5Init, MD5Update, and MD5Final functions originally written by Colin Plumb in 1993. MkMD5Db expects the name for an item specified on a line beginning with ‘!’, followed by one or more lines to be digested, and a blank line. Opstring, of course, outputs this format.
Preliminary Results
I noted in Part I that while this technique can be used for all sorts of things, these articles only construct the technique itself, deferring the actual use of it to some later time. While it is in the true spirit of software to construct tools and then move onto something else, without putting the tool to much use, it would be good to glimpse this hammer actually encountering some nails.
I took a Windows XP CD, expanded all files in the \I386 directory, and then discarded those that were not Win32 PE files. The result was 11,913 files (primarily DLLs), totaling 304 MB. I ran the following command on this directory:
for %f in (\pefiles\*.*) do dumppe -disasm
| awk -f opstring.awk
| mkmd5db >> xp_i386.db
This ran in about half an hour on a cheap laptop computer. The resulting file contained just over 500,000 function signatures, of which 44,000 different signatures appeared more than once (and, on average, was used 3.8 times; thus, about 123,000 signatures were duplicates). The small program funcdupe.awk (available electronically) pulls an entire function database into an associative array and then generates these numbers. It also factors in opstring lengths to calculate a duplication percentage. For Windows XP, the result was 17 percent.
Considering that this includes startup and runtime library code that is included in many programs, 17 percent doesn’t sound very high. On the other hand, considering that Windows is built around shared libraries (DLLs), and is supposed to comprise the very operating system itself, and not merely a loose collection of applications, this duplication rate could also be viewed as high. One study of source-code duplication in large systems reports a range (excluding comments and blank lines) from 8.7 percent in GCC, to 29 percent in a web-based message board written in Python, to 36.4 percent for a Smalltalk database server, and 59 percent for a COBOL payroll system (“A Language Independent Approach for Detecting Duplicated Code,” by Stephane Ducasse et al., http://www.bauhaus-stuttgart.de/clones/Duploc_ICSM99_IEEECopyright.pdf).
Previous Work
Given that cut and paste comprises a lot of what developers do, it is not surprising that clone detection is a rich field of research, and even has its own annual “Detection of Software Clones” workshop, held in conjunction with the IEEE’s annual Working Conference on Reverse Engineering (WCRE). For an excellent overview, see “Detecting Duplicated Code” in Object-Oriented Reenginering Patterns, by Serge Demeyer et al. (Morgan Kaufmann, 2002).
Over 10 years ago, Brenda Baker described “parameterized string matching” for the purpose of finding sections of code that are identical except for a systematic change of parameters. Initially this involved source rather than binary code, but has been extended to Java bytecodes (see http://cm.bell-labs.com/who/bsb/). Other key contributions are discussed in Parts I and II of this article.
Inspecting binary executable files has been particularly important to the antivirus industry. Binary signatures, with provision made for wildcards, are used to scan for known viruses (see Art of Computer Virus Research and Defense, by Peter Szor; Addison-Wesley Professional, 2005). It seems odd that the software used by millions every day isn’t analyzed with the same rigor as malware. In large part, Opstring takes malicious-code analysis, as used by antivirus researchers, and tries to appropriate it to analysis of commercial software.
[Added Feb. 2017: see also work being done in the area of “Big Code,” which while aimed mostly at open source, includes bulk mining of binaries; see in particular BinSim study of “statistical similarity of binaries”; paper]