Skip to main content

Discovery Metadata is Just Data

Another comment on Eran's discovery mechanisms list (which is itself great). I'm gradually reaching the conclusion that he's right that content negotiation isn't the best idea, but for the wrong reasons. The question is, are the alternatives any better?

Quoth Eran:

HTTP Content Negotiation - using the 'Accept' request header, the consumer informs the server it is interested in the metadata and not the resource itself, to which the server responds with the metadata document or its location. In XRDS, the consumer sends an HTTP GET (or HEAD) request to the resource URL with an 'Accept' header and content-type 'application/xrds+xml'. This informs the server of the consumer's discovery interest, which in turn may reply with the discovery document itself, redirect to it, or return its location via the 'X-XRDS-Location' (or 'Link') response header.

[-] Resource Declaration - does not address as it focuses on the consumer declaring its intentions.
[+] Direct Metadata Access - provides a simple method for directly requesting the metadata document.
[-] Web Compliant - while some argue that the metadata can be considered another representation of the resource, it is very much external to it. Using the 'Accept' header to request a separate resource (as opposed to a different representation of the same resource) violates the HTTP protocol. It also prevents using the discovery content-type as a valid (self-standing) web resource having its own metadata.
[-] Scale Agnostic - requires access to HTTP request and response headers, as well as the registration of multiple handlers for the same resource URL based on the 'Accept' header. In addition, improper use or implementation of the 'Vary' header in conjunction with the 'Accept' header will cause proxies to serve the metadata document instead of the resource itself - a great concern to large providers with frequently visited front-pages.
[-] Extendable - limited to a single content-type for metadata, and does not allow any existing schemas (with well known content-type).

Minimum roundtrips to retrieve metadata: 1

All of the points above are addressable with minor tweaks, turning this into a usable BigCo-scale solution. Specifically, I'd argue that it's perfectly web compliant to regard a resource's 'metadata' as a variant representation of the resource itself. As an example, consider an image resource that can be requested in several variants: image/gif, image/jpeg, or application/image-meta+xml. The last format gives you the EXIF metadata about the image but in a more convenient XML format. Format A gives you image bits; format B gives you images bits and metadata; formact C gives you just the metadata. But what's data and what's metadata just depends on your point of view.

The other argument against this is that buggy proxy caches may cache the wrong representation of, say, http://yahoo.com. This is something of a strawman in that this would need to be a cache between an RP and an OP. In any case, returning an uncacheable redirect (303?) to a metadata resource would avoid problems in practice.

All of this said, configuring something like http://yahoo.com/ (or, ahem, http://google.com/) to do content negotiation to enable discovery is a tough sell. Whatever technology is used to serve (and cache...) that page needs to be reconfigured to do the Right Thing with regard to content negotiation, with a big downside if something goes wrong and, so far, a small upside if things go well. Not a great sales pitch.

So I think Eran is right in that this isn't a great solution, but not because of web design purity; because of practical deployment issues. If there's a good alternative we should look at it and weigh the pros and cons, which is what I plan to do in the next post.

Popular posts from this blog

Personal Web Discovery (aka Webfinger)

There's a particular discovery problem for open and distributed protocols such as OpenID, OAuth, Portable Contacts, Activity Streams, and OpenSocial.  It seems like a trivial problem, but it's one of the stumbling blocks that slows mass adoption.  We need to fix it.  So first, I'm going to name it:

The Personal Web Discovery Problem:  Given a person, how do I find out what services that person uses?
This does sound trivial, doesn't it?  And it is easy as long as you're service-centric; if you're building on top of social network X, there is no discovery problem, or at least only a trivial one that can be solved with proprietary APIs.  But what if you want to build on top of X,Y, and Z?  Well, you write code to make the user log in to each one so you can call those proprietary APIs... which means the user has to tell you their identity (and probably password) on each one... and the user has already clicked the Back button because this is complicated and annoying.

The problem with creation date metadata in PDF documents

Last night Rachel Maddow talked about an apparently fake NSA document "leaked" to her organization.  There's a lot of info there, I suggest you listen to the whole thing:

http://www.msnbc.com/rachel-maddow/watch/maddow-to-news-orgs-heads-up-for-hoaxes-985491523709

There's a lot to unpack there but it looks like somebody tried to fool MSNBC into running with a fake accusation based on faked NSA documents, apparently based on cloning the document the Intercept published back on 6/5/2017, which to all appearances was itself a real NSA document in PDF form.

I think the main thrust of this story is chilling and really important to get straight -- some person or persons unknown is sending forged PDFs to news organization(s), apparently trying to get them to run stories based on forged documents.  And I completely agree with Maddow that she was right to send up a "signal flare" to all the news organizations to look out for forgeries.  Really, really, really import…
Twister is interesting.  It's a decentralized "microblogging" system based on putting together existing protocols:  Bitcoin, distributed hash tables, and Bittorrent.  The most interesting part for me is using Bitcoin for user registration and spam control.  Federated systems handle this with federated trust, which is at least conceptually simple.  The Twister/Bitcoin mechanism looks intriguing though I don't know enough about Bitcoin to really comment.  Need to read further.