US government’s National Software Reference Library (NSRL): recent article

Reading the article, it may not seem to have anything to do with IP litigation, but this National Software Reference Library appears to potentially be an important basis for a prior-art software library (that is, not a collection of publications about software, but of text extracted from the software itself, for use as prior art). Modern software generally contains a large amount of useful text. This text would need to be extracted from binary/object files, and then indexed.

The National Software Reference Library
by Barbara Guttman

LinkedIn IP Litigation discussion

The list of products in the collection is available at (3 MB text file). Of course, to be useful as searchable prior art, either to litigators or the PTO, more would be needed than this list of products or even the list of individual files comprising the products. I’m going to do some tests of text extraction against some of the files in their collection.

The fingerprints right now are file-level MD5, SHA1, etc. The original purpose, as I understand it, was so that criminal investigators would know what files they did NOT need to look at when examining a suspect’s computer. They do seem to be expanding the goals, so that now for example they’re working with Stanford to incorporate a large collection of software from 1975-1995 as part of a “digital curation” effort: .

Use of the collection as software prior art would require going down below their current file-level granularity, to do string extraction from binaries, extraction of class headers, etc.

I started a process like this, named CodeClaim, with Frank van Gilluwe and Clive Turvey. CodeClaim is a database of software prior art, generated from the software binary code itself, as opposed to using documents about the software, and in contrast to databases that exist today of open source, such as Black Duck and Palamida. Clive Turvey and me wrote a lot of back-end code, and it was used to process several hundred CDs and a few gigabytes of sample firmware code. The processing we did employed the first few of about 20 different information-extraction methods. Some proof-of-concept testing showed that strings of text in commercial software tends to contain information that would be responsive to queries based on the terminology appearing in patent claim limitations. I also did some preliminary work on weighting of terms (so that for e.g. boilerplate startup or RTL code appearing in every executable would play a reduced role in responding to queries).

Technical and legal aspects of CodeClaim are discussed, though not by name, in:


Print Friendly, PDF & Email
This entry was posted in blog, Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.