Data Integration Blog
In this blog you can find posts with useful links to, news on and analysis of things like data integration, mashups, data quality, data warehousing, application integration (EAI), data management…the list goes on.
Monday 6 October 2008, 12:57 PM
Data Quality - Upstream or Downstream?
I keep wondering how come data quality check still exists as a procedure performed once in a while, rather than as a part of the front-end process? How come most companies start worrying about the quality of your data only when it's already dirty and in use? How come it doesn't occur to them that the quality of data needs to be thought through before it’s actually captured? Even at the early stages of data capturing, data quality aleady plays an important role in the future of the company. It is the early stages that make a difference in how your data turns out and if it will pay off later on.
A recent Forrester paper titled It's Time To Invest In Upstream Data Quality suggests that when companies realize short-term data cleanup ROI immediately, it's hard to justify front-end investments that may take years.
At the same time, Forrester says, IT budget planning committees tend to avoid the existing data quality (DQ) products that allow integrating downstream data hygiene rules into front-end processes, justifying this by solutions' cost and complexity.
The result? I&KM pros quickly reach diminishing return on data quality investments, requiring even more investments later on to catch up with missed opportunities like verifying customer contact information, standardizing product data, and eliminating duplicate records.
The paper explores how to break this cycle and identify the optimal DQ solution downstream and audit source systems that cause the most significant data issues upstream.
Comments on this post
We have been evangelizing the benefits of using upstream data quality tools for many years and have clients who have benefited from the process.
In fact, we would assert that anyone doing electronic marketing via a process that stores prospect or customer data directly from the web into a database, needs to incorporate upstream data quality tools into their business process.
Great Post Alena, hadn't realised Forrester had released this, thanks for drawing my attention to it.
Resolution/prevention at source has to be one of the founding principles of good data quality management and will reap massive benefits if done correctly.
It doesn't necessarily need tools either, sure they help but what is far more important is understanding the data quality rules that will prevent dirty data flowing downstream and having an effective and controlled process to enact those rules and continuously improve like you say.
To give you an example of how effective this can be, in 1992 when I started out in data quality the company I worked for were doing data the wrong way, cleansing it downstream and paying lip-service to DQ on incoming data from 3rd parties and various processing stages.
We simply defined the rules that were required at the source of each "data river", enacted them using a mixed bag of shell scripts, SQL, adequate reporting etc. The result was astonishing, our delivery lead times dropped from 4-6 months to 3-4 weeks.
Admittedly this took some time and a cultural change to implement but the further up the river you clean the far longer and deeper the benefits travel.
If anyone wishes to carry out some upstream defect prevention or find out more about trapping data quality rules you may also find these links quite useful:
http://tinyurl.com/3uslat
http://tinyurl.com/4lcreb
http://tinyurl.com/3w45fa
I think one of the big problems is that a lot of applications have really basic data entry functionality built in so DQ tends to be an afterthought.
However you can access most back-end data stores now so I would always recommend starting with a health-check of DQ on your core business objects in the data lake to find the hotspots that are costing you most. Get some quick-wins by identifying the low-hanging fruit first and then go straight to the source of the data rivers that are causing the issue in order to implement defect prevention not cleanse.
Great post Alena and Mark is absolutely on the money there, just because it's now easy to source lots of attractive new data "rivers" doesn't just mean we can fix it in some behemoth data lake!
- Dylan Jones
