[Side note: The first time I tried to write this post my mind kept jumping to variations in standardization of phone numbers. (If you're looking for the logic of that mental transition, there isn't any. Welcome to the weird and wonderful world of my brain.) I finally gave up and wrote this post on adjusting data quality initiatives to organizations' needs -- using phone numbers as an example.]
I've was thinking earlier about Pete's comment that he's had more challenges with data quality in the Product dimension than with customers. I'm inclined to agree -- not because customer/vendor information isn't challenging -- it is. Rather, I think we've put more collective effort into developing quality rules/tools for human attributes (name, address, and phone) than we have for product attributes.
Ah, the vagaries of Product. Here's just a tip-of-the-iceberg list:
- Some materials are purchased, while some are created locally. Some similar materials are purchased from different vendors under different names and packaging, for the interchangeable purposes.
- Product's further complicated by the need to track varying levels of "completeness." ERP systems track raw materials, semi-finished goods, finished goods, and packaged goods, all under the guise of the domain "Product."
- Finished goods may be named (and packaged) differently for promotions, for different geographic regions, for different languages within the same region, and when combined with other finished goods.
- Goods' packaging rolls up to alternate package options. Think of units grouped into cases, grouped into pallets -- except when sold to a particular high-value customer in which case they might be specially grouped into a different quantity.
Working with differently-entered data, the first step is to "standardize" the records. This means recognizing the distinct components within free-form text, breaking them into their distinct bits, reassembling them in a standardized structure, and THEN running them through a match process.
An example from name and address standardization (U.S. data format):
John A. Smith, Sr; 123 South Main Street; New York, New York; 10001-1235
J. A. Smith; S. 123 Main St.; New York, NY; 10001
John Smith; 123 S. Main; NY, NY; 10001-1235
What are the parts?
- First Name
- Middle Name
- Last Name
- Suffix
- Street Number
- Street Direction
- Street Name
- Street Type
- City
- State
- Zip5
- Zip4
Now let's look at a material description, in which the same item is entered differently in different systems.
1" grommet, oblong, steel, #5, case,
#5 1" plain obl stl grommet, case
1" number 5 grommet, stainless steel, oblong, 500
#5 1" stainless grommet, plain obl, case
There's significantly less work done in this area than in the prior name/address example. In my simplistic, made-up example, we have name, size, type, shape, underlying material, quantity, and package unit. The pieces pretty easy for a human to detect; the challenge is developing programmatic rules.
It'd be interesting to see an open source ontology and rule set developed for Product attributes. There's been some work at the data interchange level; i.e. the information necessary to transfer product information between customer and vendor. If deeper work is publicly available I'm not familiar with it. I'd love to see something robust built out in this space...... (I'm working out a set of rules for my company's data quality product, but that's intellectual property I can't share.) Anyone interested in taking up the flag for the betterment of productkind?

0 comments:
Post a Comment