The previous post discussed the US government’s National Software Reference Library (NSRL), a collection of 15,000 commercial products, currently indexed by files (hash, filename.ext) comprising each product.
The post posed the question whether the NSRL could be used as the basis for a library of software prior art, usable by examiners at the US Patent & Trademark Office (PTO), and by software-patent litigants.
Such a database is needed. A major complaint regarding software patents is that the currently-searched sources of prior art are inadequate: “prior art in this particular industry may simply be difficult or, in some cases, impossible to find because of the nature of the software business. Unlike inventions in more established engineering fields, most software inventions are not described in published journals. Software innovations exist in the source code of commercial products and services that are available to customers. This source code is hard to catalog or search for ideas.’ (Mark A. Lemley and Julie E. Cohen, Patent Scope and Innovation in the Software Industry, 89 Cal. L. Rev. 1, 13 (2001)). Query: has the situation improved since 2001? Is a larger percentage of software innovation now published in journals and patents?
Open source is an obvious software prior-art resource, and good full-text indexes of open source already exist. On the other hand, proprietary source code may not even constitute prior art, which must by definition have been publicly accessible at the relevant time. Thus, to meet the need for a database of software prior art based on something other than descriptions in published journals or previously-issued patents, it is appropriate to consider a library based on whatever text can be gleaned from publicly-accessible commercial software.
To be useful as patent “prior art,” such a library would require indexing the contents of the files. The previous post indicated that commercial software products often contain useful-looking text. Such text includes source-code fragments from “assert” statements, debug statements left in the product, error messages, dynamic-linking information, C++ function signatures, and so on. Some further examples, beyond those shown in the previous post, include the following, found inside a small sample of Windows dynamic-link libraries (DLLs) known to be part of NSRL:
- “TCP and UDP traffic directed to any cluster IP address that arrives on ports %1!d! through %2!d! is balanced across multiple members of the cluster according to the load weight of each member.” [found in netcfgx.dll]
- “This DSA was unable to write the script required to register the service principal names for the following account which are needed for mutual authentication to succeed on inbound” [adammsg.dll]
- “Consider either replacing these auditing entries with fewer, more inclusive auditing entries, not applying the auditing entries to child objects, or not proceeding with this change.” [aclui.dll]
- “Transformed vertex declaration (containing usage+index D3DDECLUSAGE_POSITIONT,0) must only use stream 0 for all elements. i.e. When vertex processing is disabled, only stream 0 can be used.” [d3d9.dll]
- “On receiving BuildContext from Primary: WaitForSingleObject Timed out. This indicates the rpc binding is taking longer than the default value of 2 minutes.” [msdtcprx.dll]
These strings appear to contain patent-relevant information: terminology such as cluster, balanced, load weight, DSA (likely Directory System Agent), script, mutual authentication, inbound, auditing entries, child objects, transformed vertex declaration, RPC (remote procedure call) binding, and so on.
But do binary/object files comprising commercial products contain enough such material to be worth indexing? And is such material of the types that would be helpful to someone searching for patent prior art?
(Whether commercial software uniquely contains such information, i.e., whether this is a good source to supplement existing sources such as previous-issued patents and published patent applications, academic literature, and so on, or whether such publications already contain what we would find inside code files, is a different question which will be addressed in a later post. Another question to be addressed later is whether additional text, while not verbatim present in binary/object code files, can be safely imputed to such files, based for example on the presence of “magic” numbers, such as GUIDs or UUIDs, module ordinal numbers, and the like.)
To test whether such files contain readily-available text, of the type useful to examiners at the PTO or to patent litigants seeking out prior art, one needs to know what sorts of searches they would be doing. These searches are typically based on patent claim language, and contain the “limitations” (elements of a device or system, or steps of a method) of the claim, together with likely synonyms or examples of each limitation.
Taking as an example an unusually short software patent claim:
- “16. A method for processing metadata of a media signal comprising: embedding metadata steganographically in the media signal, wherein the metadata in the media signal includes a location stamp including: marking an event of processing the media signal with the location stamp.” [US 7,209,571 claim 16]
One looking for prior art to this claim would likely search for sources containing all of the following terms and/or synonyms or examples for each term (another complaint about software patents is that the industry lacks standardized terminology):
- metadata
- steganographic
- media signal
- location stamp
- processing
- marking
The search would likely give more weight to less-common terms (here, “steganography”). The search would be carried out across previously-issued patents and patent applications, and printed publications.
(As a quick test, one might try a Google search for “metadata steganography media signal location stamp process marking”. Google however does not currently index code files found on the web — not even readily-readable ones, such as *.js files, much less binary files — though there are web sites which do extract strings from some binary code files, and these sites are, in turn, indexed by Google.)
So, how to systematically test whether commercial code files (Windows exe/dlls, Unix .so files, iPhone .ipas, etc.) contain this sort of information, matching the sorts of terminology found in software patent claims?
As a preliminary test, one can take a large number of software patents, extract the claims, find out what words appear in these claims, and then see if these words also appear in commercial code files.
While there is no universally-agreed-upon standard definition of what constitutes a software patent, I randomly selected 2,000 patents from US patent classes classes 703, 705, 707, 709, 712, 717, 719, 726. (I will later do a similar test, using published patent applications, rather than granted patents, as a fairer test of what examiners at the US PTO would be working with.) The median patent number was 7,058,725 from 2006. I extracted independent claims from these 2,000 patents (using software which will be described in a later blog post). Individual words were then counted; a better test would break the claims into multi-word terms, by parsing along punctuation and along the most-common short words. A better test would also count the number of patents in which a word appears, rather than counting the number of words. Results of the quick test done here:
- Of course, the most-frequent words include those common to any English text: the, of, a, to, and, in, for, etc.
- Next are words common to any patent claim: said, wherein, claim, method, comprising, plurality, device, etc.
- Next are generic words common to any software patent: system, information, computer, network, program, memory, storage, server, processing, request, object, database, message, application, address, instruction, etc.
- After quite a bit of these generic terms, we finally get to more specific terms: cache, command, bus, receive, agent, security, link, vector, threshold, encrypted, tree, domain, channel, thread, token, browser, stack, etc.
It is the words in the 4th group which can now be sought out in strings of text extracted from binary/object code files. This test will not include the synonyms or examples for which a search would likely also be performed, nor will it consider translation between software patent terms (e.g. “socket address”) and programming terms (e.g. “sockaddr”).
Because individual patent-claim words would often appear within a single code word (e.g., “accounts” and “file” appear within the single word, “CreateAccountsFromFile”), matching was done of each patent-claim-word to entire entire line of code text. Large regular expressions were created for blocks of the words appearing in the selected patent claims (e.g., “voicemail|voicexml|volatile|…”). Each of the regular expressions were then run against each unique line of text extracted from the sample of 9,900 Windows DLL files. A count was made of the number of regular expressions matched, and those strings matching four or more different regexes were then printed.
The results comprised 130,000 different lines of text, out of 1,529,718 unique strings extracted from the 9,900 sample DLLs. In other words, in this simple and simplistic initial test, about 10% of the extracted strings were potentially useful in a patent prior-art search. An improved test would likely both raise and lower this 10%. Raise, because additional useful text could be found by using other techniques more sophisticated than simply running the “strings” utility. Lower, because some of the found strings are junk.
My sample of 9,900 DLLs known to be part of NSRL only represents about 1% of the 811,000 unique DLLs in the NSRL. On the other hand, there is likely less duplication of contents within files in my sample, as it included only one copy of a given filename e.g. kernel32.dll.
The 130,000 strings included those shown earlier in this post. These were cherry picked. Randomly selecting ten lines, we see:
- An outgoing replication message resulted in a non-delivery report.%nType: %1%nMessage ID: %2%nNDR recipients: %3 of %4%n%5
- The vertical position (pixels) of the frame window relative to the screen/container.WW=
- socket %d notification dispatched
- CVideoRenderTerminal::AddFiltersToGraph() – Can’t add filter. %08x
- river util object failed to AddFilteredPrintersToUpdateDetectInfos with error %#lx
- Software\Microsoft\Internet Explorer\Default Behaviors
- PredefinedFormatsPopup
- Indexing Service Catalog Administration ClassW1
- ?pL_CaptureMenuUnderCursor@@3P6GHPAUstruct_LEAD_Bitmap@@PAUtagLEADCAPTUREINFO@@P6AH01PAX@Z2@ZA
- ??0LNParagraphStyle@@QAE@ABUCDPABDEFINITION@@W4LNUNITS@@@Z
The last two are C++ function signatures, the first of which can be automatically translated into:
- int (__stdcall* pL_CaptureMenuUnderCursor)(struct struct_LEAD_Bitmap *,struct tagLEADCAPTUREINFO *,int (__cdecl*)(struct struct_LEAD_Bitmap *,struct tagLEADCAPTUREINFO *,void *),void *)
Since the extracted strings as a whole represented about 10% of the content of the underlying DLL files, and since it appears that about 10% of those extracted strings are potentially useful in a prior-art search, a very rough estimate is that 1% of the contents of code files (at least of this type, DLLs for Microsoft Windows) would be useful. As noted earlier, the tested DLLs represent about 1% of NSRL’s collection of DLL files (though likely with less duplicated contents of files, e.g., only one version of a file named kernel32.dll). Thus, the 130,000 potentially-useful strings in this test may represent about 10 million such strings among NSRL’s DLL files.
These DLL files comprise about 2.25 % of the total 36 million different files in NSRL’s collection; however, as noted in the previous post, the bulk of the NSRL files are not code but media. Further, some code files may be less “chatty” than Windows DLLs.
On the other hand, the information extraction conducted in this test was bare-bones; there are many additional ways to extract information from binary/object code files. For example, such files often contain “magic” numbers which are readily turned into text indicative of a protocol or service employed by the code. An example was shown above of turning a “mangled” C++ function signature into a “demangled” version which looks like source code. Another example is associating GUIDs and UUIDs with names of services. Many of the DLL files in the test done here could easily have been supplemented with matching debug information (PDB files) from Microsoft’s Symbol Server (this will be shown in a future blog post).
Some conclusions from these very preliminary findings will be drawn in the next blog post.