Online searching of Apple OSX and iOS binaries

An earlier post notes some examples of “deep indexing” of the textual contents of commercial software products:

Such deep indexing of binary code files has been done in some limited areas, such as the superb PDP-10 software archive at in which files have been extracted from tape images, each file given its own web page, and contents of executable files included on the page, enabling a Google search for strings. See also sites such as which, for a variety of reasons, dump strings from Windows executable files (EXEs, DLLs, etc.) onto web pages, which are then indexed by Google (see e.g. Google search for “CEventManagerHelper::UnregisterSubscriber()  : m_piEventManager->UnregisterSubscriber()”).

Another example is the online posting of Objective-C header files, extracted from Apple OSX (Mac) and iOS (iPhone, etc.) binary/object files.

Read More »

Posted in Uncategorized | Comments closed

Patent examiners on software prior art, at crowdsourcing site

The White House recently announced the US PTO’s launch of “Ask Patents” (a forum at the “Stack Exchange”) as a crowdsourcing platform to identify prior art.

Right now, the forum seems to mostly have general questions and answers. There are several interesting Q&As, in which patent examiners explain that they do not consider software itself (e.g. open source) when searching prior art. See and

One examiner explains, “It’s very tough to map a plain-english statement to a block of code in a way that will convince the attorney/applicant that it’s truly invalidating.”

That is what source-code examiners and experts do every day in software patent litigation. But with current systems, there isn’t time for examiners at the PTO to do this type of search. This is in part because, “First, the search tools that we examiners have are tuned for searching natural language, not source code, so it’s far easier to find natural-language prior art than source code prior art.”

That a rigorous code/claim comparison would take too much time during patent examination is consistent with Lemley’s theory of “rational ignorance at the patent office”: most patents will not be exercised, so defer the tough validity examination until litigation.

But one of these posts at the crowdsourcing site also states that many PTO software-patent examiners lack the skills or training to do this sort of code vs. claims comparison: “it’s far easier to find natural-language prior art than source code prior art. And your question assumes that most patent examiners who handle software-related applications are proficient at reading source code. Most of us are not….”

Further, “Even if I am absolutely sure that a certain program has implemented a procedure that’s being claimed, and even if I have access to the source code of that program, and even if I am able to establish a clear prior art date of that source code, and even if that source code is written in a programming language I am comfortable reading, I still am very unlikely to cite that source code as prior art. The people that we write for (attorneys and other patent examiners) rarely have experience reading source code, so it takes even longer to explain the code than just cite a source that explains it in natural language; a better document is something like an API reference or software documentation.”

There are several possible answers. One might be using an auto-documentation system such as Doxygen to create more-readily citable references from open source.

Posted in Uncategorized | Comments closed

US National Software Reference Library (NSRL) and Software Patent Prior Art, Part 2

The previous post discussed the US government’s National Software Reference Library (NSRL), a collection of 15,000 commercial products, currently indexed by files (hash, filename.ext) comprising each product.

The post posed the question whether the NSRL could be used as the basis for a library of software prior art, usable by examiners at the US Patent & Trademark Office (PTO), and by software-patent litigants.

Such a database is needed. A major complaint regarding software patents is that the currently-searched sources of prior art are inadequate: “prior art in this particular industry may simply be difficult or, in some cases, impossible to find because of the nature of the software business. Unlike inventions in more established engineering fields, most software inventions are not described in published journals. Software innovations exist in the source code of commercial products and services that are available to customers. This source code is hard to catalog or search for ideas.” (Mark A. Lemley and Julie E. Cohen, Patent Scope and Innovation in the Software Industry, 89 Cal. L. Rev. 1, 13 (2001)). Query: has the situation improved since 2001? Is a larger percentage of software innovation now published in journals and patents?

Open source is an obvious software prior-art resource, and good full-text indexes of open source already exist. On the other hand, proprietary source code may not even constitute prior art, which must by definition have been publicly accessible at the relevant time. Thus, to meet the need for a database of software prior art based on something other than descriptions in published journals or previously-issued patents, it is appropriate to consider a library based on whatever text can be gleaned from publicly-accessible commercial software.

To be useful as patent “prior art,” such a library would require indexing the contents of the files. The previous post indicated that commercial software products often contain useful-looking text. Such text includes source-code fragments from “assert” statements, debug statements left in the product, error messages, dynamic-linking information, C++ function signatures, and so on. Some further examples, beyond those shown in the previous post, include the following, found inside a small sample of Windows dynamic-link libraries (DLLs) known to be part of NSRL:

  • “TCP and UDP traffic directed to any cluster IP address that arrives on ports %1!d! through %2!d! is balanced across multiple members of the cluster according to the load weight of each member.” [found in netcfgx.dll]
  • “This DSA was unable to write the script required to register the service principal names for the following account which are needed for mutual authentication to succeed on inbound” [adammsg.dll]
  • “Consider either replacing these auditing entries with fewer, more inclusive auditing entries, not applying the auditing entries to child objects, or not proceeding with this change.” [aclui.dll]
  • “Transformed vertex declaration (containing usage+index D3DDECLUSAGE_POSITIONT,0) must only use stream 0 for all elements. i.e. When vertex processing is disabled, only stream 0 can be used.” [d3d9.dll]
  • “On receiving BuildContext from Primary: WaitForSingleObject Timed out. This indicates the rpc binding is taking longer than the default value of 2 minutes.” [msdtcprx.dll]

These strings appear to contain patent-relevant information: terminology such as cluster, balanced, load weight, DSA (likely Directory System Agent), script, mutual authentication, inbound, auditing entries, child objects, transformed vertex declaration, RPC (remote procedure call) binding, and so on.

But do binary/object files comprising commercial products contain enough such material to be worth indexing? And is such material of the types that would be helpful to someone searching for patent prior art?

(Whether commercial software uniquely contains such information, i.e., whether this is a good source to supplement existing sources such as previous-issued patents and published patent applications, academic literature, and so on, or whether such publications already contain what we would find inside code files, is a different question which will be addressed in a later post. Another question to be addressed later is whether additional text, while not verbatim present in binary/object code files, can be safely imputed to such files, based for example on the presence of “magic” numbers, such as GUIDs or UUIDs, module ordinal numbers, and the like.)

To test whether such files contain readily-available text, of the type useful to examiners at the PTO or to patent litigants seeking out prior art, one needs to know what sorts of searches they would be doing. These searches are typically based on patent claim language, and contain the “limitations” (elements of a device or system, or steps of a method) of the claim, together with likely synonyms or examples of each limitation.

Taking as an example an unusually short software patent claim:

  • “16. A method for processing metadata of a media signal comprising: embedding metadata steganographically in the media signal, wherein the metadata in the media signal includes a location stamp including: marking an event of processing the media signal with the location stamp.” [US 7,209,571 claim 16]

One looking for prior art to this claim would likely search for sources containing all of the following terms and/or synonyms or examples for each term (another complaint about software patents is that the industry lacks standardized terminology):

  • metadata
  • steganographic
  • media signal
  • location stamp
  • processing
  • marking

The search would likely give more weight to less-common terms (here, “steganography”). The search would be carried out across previously-issued patents and patent applications, and printed publications.

(As a quick test, one might try a Google search for “metadata steganography media signal location stamp process marking”. Google however does not currently index code files found on the web — not even readily-readable ones, such as *.js files, much less binary files — though there are web sites which do extract strings from some binary code files, and these sites are, in turn, indexed by Google.)

So, how to systematically test whether commercial code files (Windows exe/dlls, Unix .so files, iPhone .ipas, etc.) contain this sort of information, matching the sorts of terminology found in software patent claims?

As a preliminary test, one can take a large number of software patents, extract the claims, find out what words appear in these claims, and then see if these words also appear in commercial code files.

While there is no universally-agreed-upon standard definition of what constitutes a software patent, I randomly selected 2,000 patents from US patent classes classes 703, 705, 707, 709, 712, 717, 719, 726. (I will later do a similar test, using published patent applications, rather than granted patents, as a fairer test of what examiners at the US PTO would be working with.) The median patent number was 7,058,725 from 2006. I extracted independent claims from these 2,000 patents (using software which will be described in a later blog post). Individual words were then counted; a better test would break the claims into multi-word terms, by parsing along punctuation and along the most-common short words. A better test would also count the number of patents in which a word appears, rather than counting the number of words. Results of the quick test done here:

  1. Of course, the most-frequent words include those common to any English text: the, of, a, to, and, in, for, etc.
  2. Next are words common to any patent claim: said, wherein, claim, method, comprising, plurality, device, etc.
  3. Next are generic words common to any software patent: system, information, computer, network, program, memory, storage, server, processing, request, object, database, message, application, address, instruction, etc.
  4. After quite a bit of these generic terms, we finally get to more specific terms: cache, command, bus, receive, agent, security, link, vector, threshold, encrypted, tree, domain, channel, thread, token, browser, stack, etc.

It is the words in the 4th group which can now be sought out in strings of text extracted from binary/object code files. This test will not include the synonyms or examples for which a search would likely also be performed, nor will it consider translation between software patent terms (e.g. “socket address”) and programming terms (e.g. “sockaddr”).

Because individual patent-claim words would often appear within a single code word (e.g., “accounts” and “file” appear within the single word, “CreateAccountsFromFile”), matching was done of each patent-claim-word to entire entire line of code text. Large regular expressions were created for blocks of the words appearing in the selected patent claims (e.g., “voicemail|voicexml|volatile|…”). Each of the regular expressions were then run against each unique line of text extracted from the sample of 9,900 Windows DLL files. A count was made of the number of regular expressions matched, and those strings matching four or more different regexes were then printed.

The results comprised 130,000 different lines of text, out of 1,529,718 unique strings extracted from the 9,900 sample DLLs. In other words, in this simple and simplistic initial test, about 10% of the extracted strings were potentially useful in a patent prior-art search. An improved test would likely both raise and lower this 10%. Raise, because additional useful text could be found by using other techniques more sophisticated than simply running the “strings” utility. Lower, because some of the found strings are junk.

My sample of 9,900 DLLs known to be part of NSRL only represents about 1% of the 811,000 unique DLLs in the NSRL. On the other hand, there is likely less duplication of contents within files in my sample, as it included only one copy of a given filename e.g. kernel32.dll.

The 130,000 strings included those shown earlier in this post. These were cherry picked. Randomly selecting ten lines, we see:

  • An outgoing replication message resulted in a non-delivery report.%nType: %1%nMessage ID: %2%nNDR recipients: %3 of %4%n%5
  • The vertical position (pixels) of the frame window relative to the screen/container.WW=
  • socket %d notification dispatched
  • CVideoRenderTerminal::AddFiltersToGraph() – Can’t add filter. %08x
  • river util object failed to AddFilteredPrintersToUpdateDetectInfos with error %#lx
  • Software\Microsoft\Internet Explorer\Default Behaviors
  • PredefinedFormatsPopup
  • Indexing Service Catalog Administration ClassW1
  • ?pL_CaptureMenuUnderCursor@@3P6GHPAUstruct_LEAD_Bitmap@@PAUtagLEADCAPTUREINFO@@P6AH01PAX@Z2@ZA

The last two are C++ function signatures, the first of which can be automatically translated into:

  • int (__stdcall* pL_CaptureMenuUnderCursor)(struct struct_LEAD_Bitmap *,struct tagLEADCAPTUREINFO *,int (__cdecl*)(struct struct_LEAD_Bitmap *,struct tagLEADCAPTUREINFO *,void *),void *)

Since the extracted strings as a whole represented about 10% of the content of the underlying DLL files, and since it appears that about 10% of those extracted strings are potentially useful in a prior-art search, a very rough estimate is that 1% of the contents of code files (at least of this type, DLLs for Microsoft Windows) would be useful. As noted earlier, the tested DLLs represent about 1% of NSRL’s collection of DLL files (though likely with less duplicated contents of files, e.g., only one version of a file named kernel32.dll). Thus, the 130,000 potentially-useful strings in this test may represent about 10 million such strings among NSRL’s DLL files.

These DLL files comprise about 2.25 % of the total 36 million different files in NSRL’s collection; however, as noted in the previous post, the bulk of the NSRL files are not code but media. Further, some code files may be less “chatty” than Windows DLLs.

On the other hand, the information extraction conducted in this test was bare-bones; there are many additional ways to extract information from binary/object code files. For example, such files often contain “magic” numbers which are readily turned into text indicative of a protocol or service employed by the code. An example was shown above of turning a “mangled” C++ function signature into a “demangled” version which looks like source code. Another example is associating GUIDs and UUIDs with names of services. Many of the DLL files in the test done here could easily have been supplemented with matching debug information (PDB files) from Microsoft’s Symbol Server (this will be shown in a future blog post).

Some conclusions from these very preliminary findings will be drawn in the next blog post.

Posted in Uncategorized | Comments closed

US National Software Reference Library (NSRL) and Prior Art, Part 1

I’ve been looking into the possible use of the US National Software Reference Library (NSRL),, maintained by the National Institute for Standards and Technology (NIST), as a library of software prior art. Such a library would be useful both to the US Patent & Trademark Office (PTO) and to patent litigators.

The original purpose of the NSRL is largely as a set of hashes of known files, so that a criminal investigator examining a computer can know which files do NOT need to be examined.However, NSRL is moving beyond this to “digital curation,” for example, of a Stanford University Library collection of 15,000 software products from the early days of microcomputing. In contrast to their current storage in boxes and indexing only by product name (which is consistent with most library software archives), NSRL is performing file-level cataloging of the collection.

The next step would be to index the contents of the files themselves. Software binary/object code files often contain useful strings of text, relevant for example to patent prior-art searching. Such “deep indexing” or data mining of code file contents is a goal of the “CodeClaim” project (to be described in a forthcoming blog post).

Such deep indexing of binary code files has been done in some limited areas, such as the superb PDP-10 software archive at in which files have been extracted from tape images, each file given its own web page, and contents of executable files included on the page, enabling a Google search for strings. See also sites such as which, for a variety of reasons, dump strings from Windows executable files (EXEs, DLLs, etc.) onto web pages, which are then indexed by Google (see e.g. Google search for “CEventManagerHelper::UnregisterSubscriber()  : m_piEventManager->UnregisterSubscriber()”).

The core NSRL product is a hashset of 36,108,465 file hashes, listing one example of every file in the NSRL. For example, ten copies of the exact same file contents will share a single MD5 hash, even if each of the files has a different filename or file date, or came from different sources. NSRL calls this the “minimal” hashset. It is a file named NSRLFile.txt, about 4 GB in size, contained in a 2.4 GB zip file (filename from the NSRL downloads page.

Entries in NSRLFile.txt look like this:

  • “SHA-1″,”MD5″,”CRC32″,”FileName”,”FileSize”,”ProductCode”,”OpSystemCode”,”SpecialCode”
  • “00000DE72943102FBFF7BF1197A15BD0DD5910C5”, “AD6A8D47736CEE1C250DE420B26661B7”, “7854257F”, “PROGMAN.EXE”, 182032, 10912, “358”,””

Note that file dates are not included. Of course, the same exact file contents could be associated with different file dates, just as the same file contents can be associated with different file names. Dates of various types (OS file system create and write dates, (c) notice dates within files, linker dates within files) are of course crucial for a prior-art library. A method of associating dates with files will be noted later.

The collection contains media files (*.gif, *.wav, *.jpg). Crucial for a collection of prior art software, it also contains binary/object code files, for example:

  • “0000046FD530A338D03422C7D0D16A9EE087ECD9”, “680CA0BCE1FC7BC4136ADF4E210869C5″,”277D6BD5”, “TokenTypes.class”,2075,20318,”358″,””
  • “00000DE72943102FBFF7BF1197A15BD0DD5910C5”, “AD6A8D47736CEE1C250DE420B26661B7″,”7854257F”, “PROGMAN.EXE”,182032,10912,”358″,””
  • “00000FF9D0ED9A6B53BC6A9364C07074DE1565F3”, “A5D49D6DA9D78FD1E7C32D58BC7A46FB”,”2D729A1E”, “cmnres.pdb.dll”,76800,10055,”358″,”

A test of file extensions (not a guaranteed method to determine file type, but close enough for current purposes) in NSRLFile.txt provides a sense of what’s currently in the NSRL:

  • Many of the 36 million files are images (3.9 million GIF, 1.3 million JPG, 0.95 million PNG)
  • Files are predominantly from Microsoft Windows
  • A little over 1.2% are marked as “Linux”
  • There are files marked as “MacOSX”, “Mac OS 9+”, etc., but these do not appear to include binary code files (e.g., FaceTime)
  • There appear to be few mobile application files, e.g. *.ipa, *.apk
  • Many of the files are archive files, e.g. *.gz, *.zip, *.cab
  • Many of the files are compressed installers, e.g. *.msi, *.dmg; note that NSRL has researched “smart unpacking” of files
  • Many of the files are still compressed using Microsoft KWAJ, e.g. *.dl_, *.ex_
  • The most-frequently-occurring binary code file extension is *.class (Java), with 1.9 million different files
  • There are 811,468 different files with the extension .DLL (dynamic link library files for Windows)
  • There are 295,870 different files with the extension .EXE (Windows executables, possibly with some older DOS EXEs)
  • There are many different versions of code files with the same name, e.g. 835 different files (different MD5 hashes) of files with “kernel32.dll” in the name
  • There are many text files which contain (or potentially contain) source code, including 3.8 million HTML files, and about 1.7 million C/C++ files.

The following describes tests performed with Windows dynamic link library (DLL) files.

Even without access to the underlying files at NSRL itself, the presence of MD5 hashes makes it possible for anyone with a sufficiently-extensive collection of files, and a utility such as md5sum, to do some testing of the files in the NSRL database.

For example, NSRL includes a file with the MD5 hash 2bcbe445d25271e95752e5fde8a69082, and its minimal set of hashes provides the filename “IMPTIFF.DLL”.

The CodeClaim collection of code files contains about 490,000 files which are also in NSRL. One of these 490,000 files has the MD5 hash 2bcbe445d25271e95752e5fde8a69082. In CodeClaim, this file is X:\CD0138\CORELWPA\PROGRAMS\IMPTIFF.DLL; the file-system date is March 23, 1995.

Of the 811,000 files with the extension DLL in NSRL, CodeClaim currently has about 27,000. I have begun testing a subset of these: about 9,900 uniquely-named DLL files, with a total size of 2.28 GB. “Uniquely-named” means for example that one file with the name “kernel32.dll” was used out of the 90 different versions in CodeClaim; this file was selected at random, and is unlikely to be the newest or largest.

A “strings” utility was run on 9,900 DLL files, resulting in about 278 MB of output, about 10% of the size of the underlying code files. This 10% is both an over-estimate and an under-estimate of the usable text to found at least in Windows-based code files. An over-estimate because it contains a large amount of junk which merely looked like readable text to the “strings” utility. An under-estimate because “strings” is only one of at least a dozen methods of extracting useful text from binary code files. For example, given GUIDs or UUIDs in the file, these can often be turned into the corresponding textual name of a protocol or service; there are several other types of numeric-to-string lookup.

How useful would strings contained in binary code files be, for a library of software prior art? A search for “->” quickly turned up many source-code fragments which had made their way into the binary code files, presumably as “asserts” or logging statements. For example:

  • !FFlag(lppcminfo->dwPcm, PCM_RECTEXCLUDE) && FFlag(lppcminfo->dwPcm, PCM_RECTBOUND)
  • !(mod & 0x0004) || (!lpbxi->fDBCSPrio && *lpchIns == ((BYTE)’\x20′)) || (lpbxi->fDBCSPrio && *lpchIns == 0x81 && *(lpchIns + 1) == 0x40)
  • !_pmsParent->IsShadow() && ((char *)(“Dirtying page in shadow multistream.”) != 0)
  • %s — g_PluginModuleInstance->DeInitializeContext() failed.
  • %s:pChannel->RespondToFastConnect returned 0x%08lx
  • ( LSeekHf( qbthr->hf, ( (LONG)( qcb->bk) * (LONG)( qbthr )->bth.cbBlock + (LONG)sizeof( BTH ) ), 0 ))==( ( (LONG)( qcb->bk) * (LONG)( qbthr )->bth.cbBlock + (LONG)sizeof( BTH ) ) )
  • ((sidTree != sidParent) || (pdeChild->GetColor() == DE_BLACK)) && ((char *)(“Dir tree corrupt – root child not black!”) != 0)
  • (FreeBlock >= ChangeLogDesc->FirstBlock) && (FreeBlock->BlockSize <= ChangeLogDesc->BufferSize) && ( ((LPBYTE)FreeBlock + FreeBlock->BlockSize) <= ChangeLogDesc->BufferEnd)

To emphasize, we know that these snippets of code are present in the underlying NSRL collection, because the files examined in this quick test all had MD5 hashes found in NSRLFile.txt.

But so what? What difference does it make that some strings of text which resemble source code are located in commercial products? How useful is this for constructing a searching library of software prior art?

The next step is to see how the types of terminology found in code files are also used in the claims of software patents. This will be discussed in the next blog post.


Posted in Uncategorized | Comments closed

US government’s National Software Reference Library (NSRL): recent article

Reading the article, it may not seem to have anything to do with IP litigation, but this National Software Reference Library appears to potentially be an important basis for a prior-art software library (that is, not a collection of publications about software, but of text extracted from the software itself, for use as prior art). Modern software generally contains a large amount of useful text. This text would need to be extracted from binary/object files, and then indexed.

The National Software Reference Library
by Barbara Guttman

LinkedIn IP Litigation discussion

The list of products in the collection is available at (3 MB text file). Of course, to be useful as searchable prior art, either to litigators or the PTO, more would be needed than this list of products or even the list of individual files comprising the products. I’m going to do some tests of text extraction against some of the files in their collection.

The fingerprints right now are file-level MD5, SHA1, etc. The original purpose, as I understand it, was so that criminal investigators would know what files they did NOT need to look at when examining a suspect’s computer. They do seem to be expanding the goals, so that now for example they’re working with Stanford to incorporate a large collection of software from 1975-1995 as part of a “digital curation” effort: .

Use of the collection as software prior art would require going down below their current file-level granularity, to do string extraction from binaries, extraction of class headers, etc.

I started a process like this, named CodeClaim, with Frank van Gilluwe and Clive Turvey. CodeClaim is a database of software prior art, generated from the software binary code itself, as opposed to using documents about the software, and in contrast to databases that exist today of open source, such as Black Duck and Palamida. Clive Turvey and me wrote a lot of back-end code, and it was used to process several hundred CDs and a few gigabytes of sample firmware code. The processing we did employed the first few of about 20 different information-extraction methods. Some proof-of-concept testing showed that strings of text in commercial software tends to contain information that would be responsive to queries based on the terminology appearing in patent claim limitations. I also did some preliminary work on weighting of terms (so that for e.g. boilerplate startup or RTL code appearing in every executable would play a reduced role in responding to queries).

Technical and legal aspects of CodeClaim are discussed, though not by name, in:


Posted in Uncategorized | Comments closed

Good article on using Wayback Machine ( in patent litigation

One important use is as a source of reliably-dated prior art. The authors discuss admissibility and authentication issues. Two additional points not made in the article:

  • Technical experts may reasonably rely on dated web pages from
  • In addition to web pages, the Wayback Machine also contains a substantial amount of software with datestamps — a potentially good source of software prior art.

[Way]Back to the Future: Using the Wayback Machine in Patent Litigation
By James L. Quarles III, Richard A. Crudo


Posted in Uncategorized | Comments closed

Useful short two-part article on source code discovery

5 Avoidable Pitfalls in Source Code Discovery
by David A. Prange and Esam A. Sharafuddin

This is one of the few times I’ve seen the important point made: “Don’t assume that a feature in the source code is a feature in the [accused] product.”

Part 1 of the article makes another under-appreciated point: “Consider narrowing [source code] discovery requests to target specific functionalities.” Too often the default approach is “we need it all.” What one often wants is a narrow slice, but with all supporting lower-level library code as well.


Posted in Uncategorized | Comments closed

Test of Google Code Search

Posted in Uncategorized | Comments closed

California trade secrets update, March 2011

Notes for a discussion of California trade secrets: PDF file

Posted in Uncategorized | Comments closed