Skip to main content

The problem with creation date metadata in PDF documents

Last night Rachel Maddow talked about an apparently fake NSA document "leaked" to her organization.  There's a lot of info there, I suggest you listen to the whole thing:

http://www.msnbc.com/rachel-maddow/watch/maddow-to-news-orgs-heads-up-for-hoaxes-985491523709

There's a lot to unpack there but it looks like somebody tried to fool MSNBC into running with a fake accusation based on faked NSA documents, apparently based on cloning the document the Intercept published back on 6/5/2017, which to all appearances was itself a real NSA document in PDF form.

I think the main thrust of this story is chilling and really important to get straight -- some person or persons unknown is sending forged PDFs to news organization(s), apparently trying to get them to run stories based on forged documents.  And I completely agree with Maddow that she was right to send up a "signal flare" to all the news organizations to look out for forgeries.  Really, really, really important stuff.

This post, though, is going to talk about a detail that Maddow may have gotten wrong, why it may be wrong, and how this bears on the possibility that the Intercept was somehow involved vs. any of the millions of people who downloaded the Intercept's published PDF file.

First, let's start with the assumption that the PDF Maddow has is a cloned-and-modified copy of https://assets.documentcloud.org/documents/3766950/NSA-Report-on-Russia-Spearphishing.pdf which is what the Intercept published.

Maddow looked at a bunch of things including the data and metadata of the document.  One of the key pieces of metadata was the "creation timestamp" of the PDF file.  To be clear, this is just a sequence of bytes in a file and could easily be faked if anybody cared to fake it, something that Maddow made clear too.  But if you assume that (A) the document is a clone-with-modifications of the Intercept's PDF and (B) the "creation timestamp" embedded in the PDF wasn't faked, there appeared to be an interesting factoid:  The "creation timestamp" reported by Maddow for her PDF is 3 hours before the actual publication of the PDF, but of course the PDF would necessarily have been created before it was put up on the web server and 3 hours doesn't seem unusual.

But the Intercept took umbrage at the suggestion that this was suspicious, saying:
If you look at the time stamp on the metadata on the document that The Intercept published, it reads “June 5, 12:17:15 p.m.” — exactly the same time and date, to the second, as the one on the document received by Maddow:
And they include a screenshot of the output of "exiftool" which indeed reports (in human readable form) a "Create Date" of "2017:06:05 12:17:15" (with no timezone).

The Intercept then goes on to add:
It’s also possible that simple time zones explain the discrepancy: that whoever forged the document was in a time zone several hours behind East Coast time, and June 5, 12:17 p.m., in that time zone is after The Intercept’s publication, not before.
(The time zone theory doesn't make a lot of sense, because it implies that somebody created a totally new PDF document in a time zone somewhere after publication, but just happened to make the minutes and second match exactly the ones in the original creation timestamp; but at least this is something that's actually testable on a technical level.)

And with this statement, I jump into the fray, because I'm a software engineer and have had to deal with this kind of technical ambiguity in timestamps way too many times and there might in fact be a way to answer at least this one small question absolutely with no ambiguity at all.

It is possible for a PDF file to contain timezone information (see  http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf, page 160).  But sometimes software is stupid, and doesn't record what timezone it's talking about, which is horrible because it leads to confusion.  Did the Intercept's file contain a timezone?  

So I looked at the actual bytes of the Intercept's PDF.  No, it doesn't include timezone info:


Even though my local tool claims it's PDT; this is a lie:



Most likely, somebody downloaded the PDF file from the Intercept after publication, and modified it, leaving the original creation timestamp alone.  The way to tell if the two PDFs have the same or a different timestamp are to open up each one in a binary editor and search for "/CreationDate" and just compare the strings byte for byte, because timestamp formats are horrible and you can't trust the tools to get it 100% right.

So here's what you do, Rachel Maddow:  Open both PDFs in a binary editor ("vi" works on a Mac).  Search for the string "CreationDate".  See if the "D:##########" string matches in each of them.  If it matches, the files have the same creation timestamp, for whatever that is worth.

More broadly, everybody writing software: Just Say No to writing ambiguous timestamps!  And if you read one, DO NOT just slap the local timezone on the end like my local properties viewer does.  And if it's really, really important, check the bytes by hand.


Popular posts from this blog

Personal Web Discovery (aka Webfinger)

There's a particular discovery problem for open and distributed protocols such as OpenID, OAuth, Portable Contacts, Activity Streams, and OpenSocial.  It seems like a trivial problem, but it's one of the stumbling blocks that slows mass adoption.  We need to fix it.  So first, I'm going to name it:

The Personal Web Discovery Problem:  Given a person, how do I find out what services that person uses?
This does sound trivial, doesn't it?  And it is easy as long as you're service-centric; if you're building on top of social network X, there is no discovery problem, or at least only a trivial one that can be solved with proprietary APIs.  But what if you want to build on top of X,Y, and Z?  Well, you write code to make the user log in to each one so you can call those proprietary APIs... which means the user has to tell you their identity (and probably password) on each one... and the user has already clicked the Back button because this is complicated and annoying.

Twister is interesting.  It's a decentralized "microblogging" system based on putting together existing protocols:  Bitcoin, distributed hash tables, and Bittorrent.  The most interesting part for me is using Bitcoin for user registration and spam control.  Federated systems handle this with federated trust, which is at least conceptually simple.  The Twister/Bitcoin mechanism looks intriguing though I don't know enough about Bitcoin to really comment.  Need to read further.