March 18, 2005
Identifying Custom Code - Binary
Last week I presented a scenario wherein the correct version of custom code for a CFR Part 11 LIMS system had been lost, which presented the opportunity to review several methods of re-identifying custom code that was textual (meaning human readable). Let's now look for a way to reliably identify binary code in that same scenario.
Binaries files are not meant for the eyes of mere mortals. This is advantageous from a system perspective because computer programs can use characters and symbols to represent data in a much more compact format, and without the ambiguity inherent in the English language or the strictures of a programming language. This results in files that are easier for the computer system to read and translates into improved system performance. A good example is a C++ program that may require 1000 bytes of data to express in the C++ programming language but fills only 27 bytes of data after being compiled into machine language – binary.
If you're using a Windows operating system, try using Notepad to open c:\Windows\hh.exe. If you look through all the strange characters you will find snippets of English phrases, "This program cannot be run in DOS mode", for example. Nevertheless, this file does not have meaning to a human. From the perspective of trying to re-identify custom code, you quickly realize that binary files cannot be examined like text files. Of all of the manual methods of file identification we discussed in regards to textual files, a visual identification is the only method that is now completely impossible.
In addition, as we discussed with textual files, using file size and timestamp is still unreliable, and should not be considered unless as a last resort.
From a technical perspective, using a digital signature is an excellent choice, even a better choice than when comparing textual files, since binary files do not have that end of line dilemma. The reason for this is simple: ending a line and starting a new one is required only by humans, computers don't typically organize data that way. The process is the same, identify a control file and compare the digital signature with the file in question. If the binary file you're trying to identify only exists in a binary format (a Word document, for example), then you're finished: you've successfully re-identified your custom code.
However, if the binary file you're trying to identify is a compiled version of a source file, then you've still only identified something that is unreadable by humans, and does not contain enough data from which to create, or extrapolate, the original source file. Remember the C++ example we discussed earlier? In that example, a large source file was used to create a small binary, meaning that a large amount of human readable data was lost in the process. While the compilation process is advantageous for creating fast code, there does not exist a de-compilation process that is reliable and consistent. In fact, without getting into a much larger discussion, let's just say that the de-compilation of binary files should not be considered.
If you were expecting some foolproof way to identify misplaced binary files then I'm afraid you've expected far too much. My response is this: Don't loose source code, you'll be much happier.
Posted by Jeff Vannest at March 18, 2005 09:05 PM
Trackback Pings
TrackBack URL for this entry:
http://www.jandrconsult.com/cgi-bin/mt/mt-tb.cgi/27
Comments
Post a comment
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)