Joe McKendrick points to a survey in which 53 percent of respondents had 11 Terabytes or more of unstructured data. I noticed this Teneja Group survey a couple weeks ago when Paul Weinberg discussed the survey's findings, but didn't have time to read the details.
[Nothing like being sick in a hotel room to give you more time than you ever wanted to surf around the web. Of course, it takes me five times as long to comprehend what I'm reading...imagine the mess if I tried to write code.]
When you consider the interest in Oracle's recent virtualization announcements at Open World, it's interesting that the survey specifically points to a lack of virtualization-like offerings in the unstructured data space. The analyst (Steve Norall) is quoted as believing physical unstructured data is unlikely to migrate to central storage, considering it the equivalent of a hardware infrastructure upgrade.
Factoid from the survey: key drivers cited for unstructured data's growth were Microsoft Office, email, and backup/archival requirements.
Joe points to a growing interest in capturing and using knowledge contained in these unstructured stores. This meshes with some of what I heard at the IOD conference -- where at different times it was suggested that
- the next phase of BI would be to enhance business processes in real-time, and that
- BI would incorporate more knowledge gleaned from unstructured data
Peter Scott pondered this question in his post Data quality thoughts a few weeks ago. He rightfully points out that contextual indexing is a reasonable starting point, though challenging across larger bodies of text. (btw, I owe Peter Scott an apology -- he responded to my unstructured data question, but I have since changed my comment management service and lost the comment in the process. sorry!)
In a response to Peter, Ralf Scharnetzki points to Amazon's challenges with the combination of structured and semi-structured data. For example, product descriptions from 3rd-party sellers, or customer reviews. He gives a specific example of the challenges this can pose a customer. (Product descriptions are notoriously hard to standardize, as discussed in a prior post.)
So to recap:
- We have terabytes of unstructured data.
- We might choose to incorporate it into our business (or consumer) decisions and processes.
- We're not sure of its quality.
- Other than applying some good indexing, we're not sure how to assess it.
I believe the first step is to identify a list of "quality dimensions" relevant to assess in unstructured data. The standard 'dimensions' of data quality applied to structured data would need to be revisited before considering applying them to the unstructured world, or even the semi-structured. Time to start scouring the academic journals....

6 comments:
Maybe you will enjoy watching this video http://youtube.com/watch?v=dtFroEJN1nI with Luis von Ahn, assistant professor in the Computer Science Department at Carnegie Mellon University, explaining how gaming and collaboration can be used to let people for free put "tags" on unstructured data=photos. This is for sure one of the most creative solutions I have seen so far to address the challenges of structured/unstructured data.
Thanks for the link Ralf. It looks like an interesting way to gather readers' personal assessments of what they're viewing. It's probably reasonable to consider human assessments as one dimension of quality, though I believe there are many other dimensions to consider as well.
>>Time to start scouring the academic journals....
Just a thought, ethnographers have been analyzing unstructured data (i.e., fieldnotes) for some time now, that literature might be a place to start.
Time to display my ignorance: I'd never heard of ethnography before today (frightening when you consider my dad has an anthropology degree).
After a quick trip to Wikipedia to find out what it was, I have to say this sounded pretty interesting.
Thanks for the tip!
I think one of the problems isn't unstructured data, but non-standard data that can be structured. For instance, an address has rules, but different rules for different countries. Google Maps has solved this quite well; you type in unstructured data and it usually is spot on.
Something like a product detail or a consumer review is difficult, particularly because there are different standards for different products. Cost and value are pretty 'standard' for product details and reviews (respectively), but specs for a camera and a book are completely different (a 'spec' for a book would include genre, format, etc).
It would be interesting to try to work with rules, but like other things in AI such as facial recognition, I don't think it's going to work well. I have no idea how to have a machine make qualitative deductions. :-\
Thanks for the input Sheeri. I agree that it's a challenging topic. Free-form text (be it in product descriptions, customer reviews, or full-on Word documents) has an unending array of variations and varieties. It boggles the mind.
I probably shouldn't have used a product description example -- it's closer to the structured data side of the world than the memos and emails in the unstructured universe. I actually do believe product descriptions can be intelligently standardized if we apply the existing research to this domain. Name and address matching are accomplished via "bucketing" of the component parts, reassembling, and matching (with either probabilistic or deterministic rule sets). And, the rule/pattern systems doing this are nearly always tuned by humans on a per-installation basis. We can do the same thing with products, once we take the time to build out a robust rule set.
I don't mind quality not being fully-assessable via machine: even structured data requires human intervention to determine quality. Once we move beyond single-system referential integrity we get into a surprizingly 'fuzzy' world in which data may or may not be of good quality, depending on the orgranization's business rules (and the rules of other systems with which the data must interact)
Ok, I'm starting to ramble... Thanks.
Post a Comment