US National Software Reference Library (NSRL) and Prior Art, Part 1

I’ve been looking into the possible use of the US National Software Reference Library (NSRL), http://www.nsrl.nist.gov, maintained by the National Institute for Standards and Technology (NIST), as a library of software prior art. Such a library would be useful both to the US Patent & Trademark Office (PTO) and to patent litigators.

The original purpose of the NSRL is largely as a set of hashes of known files, so that a criminal investigator examining a computer can know which files do NOT need to be examined.However, NSRL is moving beyond this to “digital curation,” for example, of a Stanford University Library collection of 15,000 software products from the early days of microcomputing. In contrast to their current storage in boxes and indexing only by product name (which is consistent with most library software archives), NSRL is performing file-level cataloging of the collection.

The next step would be to index the contents of the files themselves. Software binary/object code files often contain useful strings of text, relevant for example to patent prior-art searching. Such “deep indexing” or data mining of code file contents is a goal of the “CodeClaim” project (to be described in a forthcoming blog post).

Such deep indexing of binary code files has been done in some limited areas, such as the superb PDP-10 software archive at http://pdp-10.trailing-edge.com/ in which files have been extracted from tape images, each file given its own web page, and contents of executable files included on the page, enabling a Google search for strings. See also sites such as totalhash.com which, for a variety of reasons, dump strings from Windows executable files (EXEs, DLLs, etc.) onto web pages, which are then indexed by Google (see e.g. Google search for “CEventManagerHelper::UnregisterSubscriber() : m_piEventManager->UnregisterSubscriber()”).

The core NSRL product is a hashset of 36,108,465 file hashes, listing one example of every file in the NSRL. For example, ten copies of the exact same file contents will share a single MD5 hash, even if each of the files has a different filename or file date, or came from different sources. NSRL calls this the “minimal” hashset. It is a file named NSRLFile.txt, about 4 GB in size, contained in a 2.4 GB zip file (filename rds_243m.zip) from the NSRL downloads page.

Entries in NSRLFile.txt look like this:

  • “SHA-1″,”MD5″,”CRC32″,”FileName”,”FileSize”,”ProductCode”,”OpSystemCode”,”SpecialCode”
  • “00000DE72943102FBFF7BF1197A15BD0DD5910C5”, “AD6A8D47736CEE1C250DE420B26661B7”, “7854257F”, “PROGMAN.EXE”, 182032, 10912, “358”,””

Note that file dates are not included. Of course, the same exact file contents could be associated with different file dates, just as the same file contents can be associated with different file names. Dates of various types (OS file system create and write dates, (c) notice dates within files, linker dates within files) are of course crucial for a prior-art library. A method of associating dates with files will be noted later.

The collection contains media files (*.gif, *.wav, *.jpg). Crucial for a collection of prior art software, it also contains binary/object code files, for example:

  • “0000046FD530A338D03422C7D0D16A9EE087ECD9”, “680CA0BCE1FC7BC4136ADF4E210869C5″,”277D6BD5”, “TokenTypes.class”,2075,20318,”358″,””
  • “00000DE72943102FBFF7BF1197A15BD0DD5910C5”, “AD6A8D47736CEE1C250DE420B26661B7″,”7854257F”, “PROGMAN.EXE”,182032,10912,”358″,””
  • “00000FF9D0ED9A6B53BC6A9364C07074DE1565F3”, “A5D49D6DA9D78FD1E7C32D58BC7A46FB”,”2D729A1E”, “cmnres.pdb.dll”,76800,10055,”358″,”

A test of file extensions (not a guaranteed method to determine file type, but close enough for current purposes) in NSRLFile.txt provides a sense of what’s currently in the NSRL:

  • Many of the 36 million files are images (3.9 million GIF, 1.3 million JPG, 0.95 million PNG)
  • Files are predominantly from Microsoft Windows
  • A little over 1.2% are marked as “Linux”
  • There are files marked as “MacOSX”, “Mac OS 9+”, etc., but these do not appear to include binary code files (e.g., FaceTime)
  • There appear to be few mobile application files, e.g. *.ipa, *.apk
  • Many of the files are archive files, e.g. *.gz, *.zip, *.cab
  • Many of the files are compressed installers, e.g. *.msi, *.dmg; note that NSRL has researched “smart unpacking” of files
  • Many of the files are still compressed using Microsoft KWAJ, e.g. *.dl_, *.ex_
  • The most-frequently-occurring binary code file extension is *.class (Java), with 1.9 million different files
  • There are 811,468 different files with the extension .DLL (dynamic link library files for Windows)
  • There are 295,870 different files with the extension .EXE (Windows executables, possibly with some older DOS EXEs)
  • There are many different versions of code files with the same name, e.g. 835 different files (different MD5 hashes) of files with “kernel32.dll” in the name
  • There are many text files which contain (or potentially contain) source code, including 3.8 million HTML files, and about 1.7 million C/C++ files.

The following describes tests performed with Windows dynamic link library (DLL) files.

Even without access to the underlying files at NSRL itself, the presence of MD5 hashes makes it possible for anyone with a sufficiently-extensive collection of files, and a utility such as md5sum, to do some testing of the files in the NSRL database.

For example, NSRL includes a file with the MD5 hash 2bcbe445d25271e95752e5fde8a69082, and its minimal set of hashes provides the filename “IMPTIFF.DLL”.

The CodeClaim collection of code files contains about 490,000 files which are also in NSRL. One of these 490,000 files has the MD5 hash 2bcbe445d25271e95752e5fde8a69082. In CodeClaim, this file is X:\CD0138\CORELWPA\PROGRAMS\IMPTIFF.DLL; the file-system date is March 23, 1995.

Of the 811,000 files with the extension DLL in NSRL, CodeClaim currently has about 27,000. I have begun testing a subset of these: about 9,900 uniquely-named DLL files, with a total size of 2.28 GB. “Uniquely-named” means for example that one file with the name “kernel32.dll” was used out of the 90 different versions in CodeClaim; this file was selected at random, and is unlikely to be the newest or largest.

A “strings” utility was run on 9,900 DLL files, resulting in about 278 MB of output, about 10% of the size of the underlying code files. This 10% is both an over-estimate and an under-estimate of the usable text to found at least in Windows-based code files. An over-estimate because it contains a large amount of junk which merely looked like readable text to the “strings” utility. An under-estimate because “strings” is only one of at least a dozen methods of extracting useful text from binary code files. For example, given GUIDs or UUIDs in the file, these can often be turned into the corresponding textual name of a protocol or service; there are several other types of numeric-to-string lookup.

How useful would strings contained in binary code files be, for a library of software prior art? A search for “->” quickly turned up many source-code fragments which had made their way into the binary code files, presumably as “asserts” or logging statements. For example:

  • !FFlag(lppcminfo->dwPcm, PCM_RECTEXCLUDE) && FFlag(lppcminfo->dwPcm, PCM_RECTBOUND)
  • !(mod & 0x0004) || (!lpbxi->fDBCSPrio && *lpchIns == ((BYTE)’\x20′)) || (lpbxi->fDBCSPrio && *lpchIns == 0x81 && *(lpchIns + 1) == 0x40)
  • !_pmsParent->IsShadow() && ((char *)(“Dirtying page in shadow multistream.”) != 0)
  • %s — g_PluginModuleInstance->DeInitializeContext() failed.
  • %s:pChannel->RespondToFastConnect returned 0x%08lx
  • ( LSeekHf( qbthr->hf, ( (LONG)( qcb->bk) * (LONG)( qbthr )->bth.cbBlock + (LONG)sizeof( BTH ) ), 0 ))==( ( (LONG)( qcb->bk) * (LONG)( qbthr )->bth.cbBlock + (LONG)sizeof( BTH ) ) )
  • ((sidTree != sidParent) || (pdeChild->GetColor() == DE_BLACK)) && ((char *)(“Dir tree corrupt – root child not black!”) != 0)
  • (FreeBlock >= ChangeLogDesc->FirstBlock) && (FreeBlock->BlockSize <= ChangeLogDesc->BufferSize) && ( ((LPBYTE)FreeBlock + FreeBlock->BlockSize) <= ChangeLogDesc->BufferEnd)

To emphasize, we know that these snippets of code are present in the underlying NSRL collection, because the files examined in this quick test all had MD5 hashes found in NSRLFile.txt.

But so what? What difference does it make that some strings of text which resemble source code are located in commercial products? How useful is this for constructing a searching library of software prior art?

The next step is to see how the types of terminology found in code files are also used in the claims of software patents. This will be discussed in the next blog post.

 

This entry was posted in blog, Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

One Trackback

  1. […] « US National Software Reference Library (NSRL) and Prior Art, Part 1 […]