Archive for February, 2007

PBCore for publishing, sharing, and preservation

Wednesday, February 28th, 2007

(I’m moving this from the comment section of John Proffitt’s post “RSS a good start, but a federated PBCore-based metadata archive would be better” at his suggestion. Comments are perhaps getting buried, but please do see that thread for more context and great points by all participants. Of course I edited this since I can’t leave anything alone…)

John’s and Dale’s ideas here about using PBCore are excellent, and this is a great place to discuss shaping new practices with media and metadata. I do think XML is the key to unlock access to content, and to expose metadata in whatever flavor and variety is wanted for particular purposes. So an RSS 2.0 feed is good for one purpose, and a PBCore XML record works for something a little more full-blown. In the latter case, we could use PBCore records to connect the dots in a federated collection of public media content at a highly granular level, by developing applications to parse, sift, search, and serve the data. It could look like one collection, but the content could be anywhere. This model is becoming more common in the library world where an XML protocol like OAI-PMH is used. (See http://www.openarchives.org/ )

With this in mind I recently developed templates in my content management system to output various XML formats, including RSS, Atom, PBCore, and Dublin Core. You might have seen me fumble thru a demo of this at the IMA Tech Session Show and Tell. You can see the beta version of this at http://will.atlas.uiuc.edu/index.php/prairiefire/ . Scroll down to find the Syndication menu in the left nav. The PBCore link will generate PBCore XML for the latest 10 episodes of this show we produce called Prairie Fire. When you are on one of the Episode content pages, the PBCore URL reflects just that episode. Same for Segment PBCore URLs. The URL calls the template to display the specific record or set of records, so it becomes the key to everything.

To what end? Right now, it’s just a demonstration or proof of concept. Eventually this could be used by Content Depot or NGIS to suck in metadata and media objects for system-wide syndication. (You know, as in Syndication.) In this case, the primary media item would be a broadcast-quality file, not a streaming archive. Then you’d also have a reference to the streaming archive as part of the PBCore record, along with other versions and related assets like a thumbnail image, etc. But I’m not sure PBCore is the right format to wrap up related media assets, so we could use standards like like MODS or METS which can include PBCore records as nested elements. In fact, when people begin using our media we’ll want to harvest tags and trackbacks, which add valuable metadata to the existing record. So we’ll want a way to encode this metadata and allow the total package to evolve. PBCore can be the item-level metadata format, but all related items might best be encoded in something else. Then everything can live and breathe as an item, a collection of related items, and a collection of collections. (Am I getting too meta here?) I’m suggesting that this method leads to media objects that harness collective intelligence, with metadata records that evolve with use. Our technical systems should allow for preservation of this metadata along with the media object at its core.

So what to do next? I’m going to finish building out my little CMS implementation and see where it leads. There are zero actual PBCore applications that can use this stuff, far as I know. But this is really easy to do, and it might lead to some other easy ideas…which I think are often the best kind!

RSS a good start, but a federated PBCore-based metadata archive would be better

Tuesday, February 27th, 2007

I’d like to echo Dale’s posting, and expand upon it just a bit more.

First off, I agree that the political hurdles to implementing a standardized and centralized media back-end for the public media world are daunting. Further, I think what we see as “public media” is going to shift around rapidly in next couple of years, so determining who is “allowed” into the fold will becoming increasingly difficult (e.g. can a library join, or do you have to be a broadcaster with an active high-power FM or TV license?). There are other challenges as well, but let’s leave that issue alone for the moment. Back to the tech…

I think a centralized storage system is probably a bad idea, or at least one that would be difficult to achieve for all kinds of reasons. It’s also unnecessary. Why does everything have to be stored together, under one roof? The storage can be anywhere. It’s the live, searchable content index that would be most useful to the public, to other stations, to search engines and more. Let’s just remember that storage and indexing do not have to occur at the same place.

Now, about RSS. I think RSS is a great syndication system for short-form and linked media for recently published items. But RSS strikes me as insufficient as a deep-catalog syndication system. For example, how would I syndicate — using RSS — a catalog of 50,000 items or 100,00 items, in which the items are drawn from a variety of subjects and media formats and sources, each with various rights and authors associated with them? Theoretically, RSS could do this, as it’s just a string of XML. However, RSS 2.0 in its baseline configuration doesn’t carry all the data a centralized search system would need. Sure you can extend RSS with your own additional XML tags (just look at iTunes), but it still sounds a little silly to me to do it that way.

What I would propose is the establishment of a standard metadata description and storage pointer language, based on the PBCore schema (which is pretty complete already). Each public media entity would then expose its metadata index and its digital media archive to the public, to other stations, and to a centralized repository that would periodically accept updates from the edge storage and indexing systems. Access to the data could be tiered as desired, exposing only those items you wish to expose to various users or partners.

Using this metadata standard would allow the proposed central index to gather information from repositories both inside and outside the public media world.

In this way, we have the local control required (for whatever reasons) over media assets, yet the central searchability of our content is not impaired. Local entities would be required to meet certain metadata standards (and tests) before being accepted into the central indexing system. And getting into the system would be a high priority for any media companies wanting to be “found” online, especially in areas beyond the reach of any legacy transmitters.

The big plus is that while there would have to be an entity building and maintaining the indexing service, the various players would only have to meet a baseline standard protocol, mostly eliminating the politics. Yes, fights break out at the IEEE from time to time, but in the end, they do reach broadly interoperable standards.

Or… and here’s a subversive bit… do we just implement the metadata standard and then call up Google and tell them how and where to index all our content?

Thoughts from back home: centralize output, not input

Monday, February 26th, 2007

Great conference as usual, and more to digest than can be done at one sitting. I’d like to chime in on the metaconversation that has run through the last three conferences on the possibility and desirability of a unified platform for public broadcasting. I am one of those who feel that such a move is both impossible and undesirable.

Impossible for a number of reasons; here are a few:

  1. Too broad a variety of players with too diverse a set of needs and too great a disparity of resources and capabilities.
  2. Too decentralized an organizational structure for there to be a central authority to mandate and enforce adoption of a non-binding resolution, let alone a content management system and accompanying business model.
  3. If one could reach agreement on a unified approach, that process, plus the infrastructure development and deployment process would likely take so long that the final result would be already obsolete on rollout.

And it would be undesirable for a number of reasons; here are a few:

  1. It would build inflexibility into the system–the platform would have to be renogotiated and reinvented every time a technological surprise comes along.
  2. It would enforce a least common demoninator set of features
  3. It would discourage the lively development of new media literacy and expertise at the station and producer levels.
  4. Where unified platforms exist, as at cbc.ca, the region level (or station-level in the US model) almost disappears. I recall from 2 years ago that CBC presented numbers showing that 95% of traffic went to the national level, while the 16 region sites split the remaining 5%. While this may be less important in a centrally funded service, it would be a killer in a system like ours where most revenue in the system is derived from the station.org level.

The question then becomes, how do we gain for the system the benefits that could come from a centralized platform, without actually having to build one. I propose that we take the focus off how stations and other entities get content into their websites, and put the focus on how to get content out of their websites, i.e. syndication.

I think this is a more fruitful approach because it bypasses the steep hurdles presented by organizational politics, and also because it would be be a necessary process even if we were to create a common platform. The obvious candidate is RSS syndication, since it is already widely understood and adopted. Most of the platforms used for citizen journalism and UGC already have some RSS capability and provide features that can put organizations of very limited technical capability into the game. A wide variety of the content management systems developed or adopted by stations for site management can already use RSS, or could be tweaked to produce RSS for small investments.

Once at the point where all (or at least most) stations can export their content in a common form, exploiting, aggregating and monetizing the result becomes a task divorced from how the content was created. It seems that this is a place where a more centralized approach becomes both practical and desirable.

As a model, I point to the NPR podcast project, which brings in content from a truly wide variety of sources with varying capabilities, and bundles it with national branding, national underwriting, traffic reporting and revenue sharing, and can still accomodate a regional underwriter segment, and doesn’t prevent producers from also distributing the same core content via their own station.org or program.org addresses. With a limited investment in export standards, a similar portal-style approach could be applied to the whole of the system’s output without having to pry station’s longstanding approaches to the web from their cold, dead hands.