Chapter 17: Indexing source code
17.1 Why source code generally should be indexed before searching
- Huge volume of source code typically produced in litigation; searching is too slow without prior indexing
- Indexing can create new information in its own right: counting; frequency of occurrence
- “Searching by counting”
- “Searching by sorting”
- Tokenization
- Normalization (e.g. replacing all names with XXX, for code comparisons)
- Identifying key unique terms and phrases in code (similar to Amazon SIPs, “statistically improbable phrases”)
17.2 Indexing with standard tools
- SciTools Understand, including Lucene-based indices
- dtSearch
- Searching for contents inside files, across directories, in Windows Search with indexing
- Triage: selecting which portions of source code to index
17.3 Creating an index using scripting languages commonly available on locked-down source code machines
- The locked-down non-networked computer problem; see ch. 15 on source-code exam environment
- Indexing/search program written in awk, for use on Linux and Mac OSX computers
- Indexing/search program written in Visual Basic, for use on Windows computers