For the past 12 years or so, I've been employed primarily as a document processing specialist, mostly working as a temp for various global law firms' NYC offices (we're talking mainly about law firms with anywhere from 400-4,000 attorneys--yes, you read right, four thousand attorneys). This is basically just a fancy word for a word processor, except that we go beyond word processing.
Beyond editing and formatting complex legal and non-legal documents, we're expected to troubleshoot documents that have bone "bad" and rescue them. Worst of all is that most of our work is done in Microsoft Office, which not only is bloatware itself but turns all of its documents into bloatware, too.
So nine years ago (back in 2000) when we (and I mean that in the most general sense -- the "we" being those the IT industry working in legal environments and those of us in document production centers for these huge global law firms) noticed this trend of documents to become bloated--e.g., document size went up even if text was being deleted and nothing was being added--we decided to have a look-see into the files.
At the time, half of the documents were still being produced in Word Perfect, which wasn't showing any signs of bloated documents, so we knew it was Microsoft doing something to the documents. The law firms had their IT departments investigate the documents, breaking into Microsoft's file system (with their help, of course). Microsoft was just getting into the business of selling its products to these large law firms, and the only reason that the law firms were switching was because their clients were demanding it.
So P.S., the IT department figured out that the reason MS Office documents kept getting larger and larger in size was because the file format was based on meta information. That is, Office kept track of everything that you did in a document and stored it as a command. So if you deleted something, the deleted text was still there -- Office just didn't show it because it was marked as deleted.
Well, over the course of the next few months, there was a flurry of buzz and activity, especially in litigation departments. The war machines of litigation went into an all-out frenzy, because they were able to discover exactly *what* had been done to a document by an opposing party to the lawsuit and even what opposing counsel had done to a document. And not only did it show what was done to a document throughout the document's entire history, it showed who did it!
Obviously, this caused great problems for clients and attorneys alike who were embroiled in the discovery process during litigation, and it also affected opposing parties during all sorts of different legal scenarios, such as during contract negotiation or merger/acquisitions, etc. Opposing counsel, through their IT departments, were able to view everything that was done to a document. They could see that certain words were replaced in favour of other words, and they were also able to see which attorney made the change (was it a junior associate with little experience, or the senior partner on the legal team--or was it the document processing department making edits submitted by an attorney and, if so, they could then subpoena for those records...).
You can see what a real nightmare this turned out to be. So law firms quickly developed policies regarding how to deal with all of this "meta" information. If the document was staying internal to the firm, nothing was done to it. However, if the document was being sent outside of the firm -- even to its own clients -- that meta information couldn't be included in the document, no IFs, ANDs, or BUTs about it. The meta information had to be stripped, and stripped quickly.
At first, the easiest way to do this was to just convert the document to a PDF. At the time, Adobe's Portable Document File format didn't include any of this meta-information; the text simply was converted to their format, and any extraneous information was left behind, leaving a forensically-clean document. Oh, speaking of forensics, let's not forget to mention that lawyers weren't the only ones to wield this meta-information against opponents; the government also was able to utilize this information in the prosecution of criminals.
The problem with the PDFs is that they were, for the most part, uneditable, and sooner (rather than later) the law firms' clients began to grumble when they started receiving these PDF files whenever they asked for a copy of a document to be emailed to them. A better solution was needed.
Along came the meta-stripper utility. I don't know who got to market first, and it doesn't really matter. The fact was that it became available, and it was usually integrated into the firms' document management systems (in large organizations where, quite literally, millions of documents are stored on a network, a document management system (DMS) is employed. Usually, documents are saved on the network storage system under a number, and the DMS is basically a database that contains information about the file, matching descriptions of the document with the document number).
The meta-strippers are basically that; they strip out all of the meta-information from a document, leaving a PDF-clean version of the document, the major difference being that the document was still editable. So the lawyers solved their problems and the clients were happy.\
Fast forward to 2008. A news story just broke out that Google, Inc. was caught with their pants down, because they submitted an anonymous document that criticized eBay's decision to allow only PayPal for payment of auctions on eBay, effectively keeping Google's own Google Checkout competition out of the running and out of the loop. How were they caught, you might be asking?
Apparently, Google submitted a PDF, anonymously. But the PDF contained meta-data (apparently, Adobe changed the file format of the PDF to allow for meta-data) and--yup, you guess it--Google's fingerprints were all over the metadata.
So you see, your documents might be telling a lot more about you than you realize.
Most digital cameras nowadays save metadata to the picture--this is sometimes referred to as EXIF information (don't ask me why). It can include GPS location coordinates, the time and date that the photo was taken, the device and serial # of the device used to take the photograph, as well as a few other tidbits of information that could be used to track you down.
The information age is great, because of all of the information available to us, and at the same time is dangerous for the very same reasons that it is great.
For more information about metadata and metadata scrubbing utilities, check out this web site.


Comments: 25
(I hate PDF files myself)
For years I have taken a word document, ported it to plain text (Notepad), then put it back in a new Word document. Nothing comes across because it is a virgin document.
You all might wish to check out this web site, for more information, and some reviews of metadata-scrubbing utilities. I'm going to edit the article to include this, as well:
http://www.addbalance.com/usersguide/metadata.htm
Thank you for enlightening me - although I was aware to a vague degree this has crystallized the process for me.