[an error occurred while processing this directive] [an error occurred while processing this directive]
Jeff Vannest's Weblog

« "Pittcon 2005" | Main | "Identifying Custom Code - Binary" »

March 15, 2005

Identifying Custom Code - Textual

Eventually in the lifecycle of every project someone will look at a piece of custom software and think to themselves, "Is this the right version, or did I put the new version on my other computer?" While it seems silly for a company that likely spent thousands of dollars on analyzing, developing and validating that particular piece of software, this still happens in even with the best companies. Let's look at some ways a company can re-identify a piece of custom code that is textual – meaning that the code is stored in a human entered, and human readable format.

The simplest way to compare two text files is to inspect the contents. If one file contains the words "Return 1" and the other contains "Return 2", then the files are different. As the files become larger, inspecting the contents visually becomes infeasible. Some text editors (TextPad, for example), allow the user to load 2 text files and compare them against one another. If the files are identical, the program returns "The files are identical". If the files are not identical, the program lists every line in the first file that is different from the second. Many source code control programs (CVS, for example), also include the ability to detect changes in text files. The Linux operating system even includes a command called "diff" that allows the comparison of two files right at the command line.

Another method of comparing files is by comparing the timestamp and file size. Comparing files by timestamp is one of the least acceptable methods of comparing file contents, although it is the method favored by the non-technical. First, timestamps can be intentionally altered. This is unacceptable since CFR 11 is clear about the establishment of policies that "deter record...falsification". While a piece of custom code may not primarily constitute a "record", it is easily argued that it may be used to create and maintain records in the system, and therefore must be treated with the same deference as a system record. Second, timestamps can be unintentionally altered. Various methods of file transportation can alter a file's timestamp including FTP transfer, email attachment through the corporate email servers, email attachment through internet mail servers, attachment to various forms of instant messaging, certain source code control programs, etc.

Comparing files by file size is also not a recommended method of comparing file contents. First, it is possible to have an arrangement of characters that result in the same file size. The reason for this is that file size is reported by the number of bytes the file contains, not the contents of those bytes. Going back to our "Return 1" and "Return 2" file examples, these files both have a file size of 8 bytes when saved on a Windows operating system. Since both characters "1" and "2" are expressed using one computer byte, these files would be considered identical if using file size alone.

The technical method of comparing files is to use a program to calculate a digital signature based on a known and reliable algorithm. MD5 is probably the most famous algorithm used today. However, based on recently discovered flaws ,it is likely that MD5 will be replaced by the SHA-1 algorithm, which has a higher bit depth (meaning that it is inherently more secure), which does not contain any demonstrable flaws. The digital signature takes into account each byte (you can think of a byte as a single character for now) and calculates a string of characters unique to that file. Therefore, changing a single character in the file changes its digital signature. While this seems like the perfect solution, the digital signature of a text file can change without changing the visible contents of the file. Although this seems like a showstopper, the possible reasons for a signature change are known and repeatable, meaning that a knowledgeable person can get matching signatures every time from identical files if the environment of the file is maintained.

Text files are comprised of characters you can see and those you can't. The characters you can see typically represent words or symbols. For example, let's assume a LIMS system is using the following Oracle PL/SQL function:

FUNCTION server_datetime RETURN VARCHAR2 IS
BEGIN
  RETURN TO_CHAR(SYSDATE, 'DD-MON-YYYY HH24:MI:SS');
END;

This function is comprised of certain symbols you can't see, which in programming circles are called "whitespace". The most famous in the whitespace entourage is the space character, followed by his lackeys the tab and new line characters. Spaces are simple: you hit the spacebar on your keyboard and the computer enters an ASCII character number 32. Likewise, tabs are simple: you hit the tab key on your keyboard and the computer enters an ASCII character number 09. End of line characters are more complex, and can be comprised of one or two characters depending on the computer operating system: the carriage return, which is ASCII character 13, and line feed (sometimes called new line), which is ASCII character 10. When a text file is written and saved on a Windows computer every line ends with a carriage return and a line feed. A Macintosh computer uses only a carriage return and a UNIX computer uses only a line feed.

Given this knowledge, we know that if a text file is saved on a Windows operating system, it will contain exactly the same bytes and will have the same digital signature every time. If however, the file is transferred to a UNIX system, opened, re-saved, and transferred back to the Windows operating system that the digital signature of the file will change even if no content was changed on the other system. This is because the UNIX computer removed all carriage returns from the file (remember, UNIX computers don't use carriage returns when they end lines of text). So the function we saved above is now missing 4 bytes of data, one byte for each line of text saved. As you can probably guess, I still recommend using the digital signature method of identifying the file contents of text files not because it is perfect and foolproof, but because it is technically reliable.

Back to our original problem, what if you're not sure that a piece of custom code is the correct version or not? Simple: choose a control and compare the file in question. If you have the source code already stored in a source code control tool then use that as your control. If the file is not available in the control tool, pull a copy of the file from your production server and use that file as your control. Run a digital signature on both pieces of code; if the signatures match then the files are identical. If the signatures do not match, then the file in question is either a earlier or later iteration of the source, and you'll need to visually inspect the file. Certainly, a visual inspection is not ideal, but the correct version of the source code should have not been lost in the first place, so let's not beat that dead horse.

Next week we'll look at how to identify custom code that is stored in a binary format.

Posted by Jeff Vannest at March 15, 2005 06:30 PM

Trackback Pings

TrackBack URL for this entry:
http://www.jandrconsult.com/cgi-bin/mt/mt-tb.cgi/26

Comments

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?


 
[an error occurred while processing this directive]