Adrian (one of the Django dudes) has an article on XML.com entitled ‘Dynamic News Stories’. It’s a neat concept idea of slapping a little additional light XML into a stream of data to hopefully mark up freetext with something a little more contextual. Actually – right in line with the original concept of hypertext.
And I would have posted this there – if their system didn’t send me into an infinite redirect loop when I clicked on “Comment on this Article” on page 2.
The problem isn’t with the scheme to store metadata – or even making systems that read that data. There’s lots of those, and while it’s not “trivial” – I wouldn’t call it hard. What is hard, and will remain hard, is getting us monkeys to add the metadata. When we get value out of it, we’ll do it. If we don’t get some value immediately… forget it. After working with several companies who live by metadata, I’m convinced – it’s just human nature. And I find myself arguing this point with lots of people who say things like “If they’d just enter the meta data…”
Forget it – they won’t. That’s why we need digital cameras to embed the long/lat data INTO the jpeg to make location information a useful part of the that whole “where picture” data concept. Cell phone, ironically, have about the best chance of making this happen quickly – with thier shitty little pictures. But a shitty little picture with location data – that’s worth something a bit more.
So why would someone, unless they were paid or forced at gunpoint, put a profanity XML markup around that previous sentence? I didn’t – it would even have been pretty easy. And it’s why probabilistic systems that derive metadata automagically from free-text are so very, very powerful. Because they make up for our laziness. I bet there’s a good bayesian filter someone, all trained up, that will recognize profanity and be able to make it with all the neat levels that Adrian would like to see. And I think that’s about the only way we’re going to see it.
And by the way – I don’t think Adrian is stupid or anything… in fact, I think he rocks. At least due to Django and because he’s from my Alma Mater. Gus even spoke highly of him – but Django does all the talkin’ he needs to do.
Yeah, I agree that metadata is expensive, and I’ve encountered a fair amount of resistance to it by human data-entry folks at the various places where I’ve worked. But, considering newspapers employ full-time staff members (copy editors) whose job it is to edit stories and fix grammar/style/factual errors, I don’t think it’s unreasonable for the extra metadata insertion to happen at that point.
Of course, it’d be much more ideal to make a tool that identified all possible profanities, dates, cities, etc., to which copy editors would give final approval. I’m definitely with you on the importance of automation. My point with the article was to introduce the ideas, not worry about the implementation yet.
Thanks for reading and commenting!
LikeLike