I haven't written about data quality for ages! But the subject is as present as ever, and there is still such a long way to go! If Analytics has a big, bright pink elephant in the room, data quality is it!
Molly Vorwerck posted a link to this article in the #measure slack about The Right Way to Measure ROI on Data Quality a couple of months ago, and if it says "data quality" in the title, chances are I'll read it.
What I took away from this specific article are the two metrics "time to detection" (TTD) and "time to resolution" (TTR). They make a lot of sense, and if you can put an amount on them, you'll have much more leverage, as the article explains.
For me, reading about TTD was a bit of a light bulb moment!
This is precisely what I always had in mind when I thought/spoke/wrote about why and where a testing framework would make most sense.
Your TTD, if you do not have any testing tool or framework can easily be weeks, and sometimes, with Analytics, errors will not be detected at all. Frankly, that is unacceptable!
We are in this because data is what we use to make informed choices! We want to use data! We must be able to trust our data!
So, what can we do?
"To infinity and beyond!"
Well, we ideally want TTD to be as small as possible.
Your TTD, if you are using an end-to-end tool (ObservePoint would be a well-known example), can be closer to days, or even minutes. That is a huge step forward!
Whatever you determine your cost is for bad data, you have just cut it down to maybe 1/10, probably even less than that.
A lot of people spend money on ObservePoint and similar tools, because overall, their data will be more valuable. They pay money so that they can reasonably say that they trust their data.
Can we go further? Yes, we can!
Your TTD, if you are using a tool for testing in UAT or during integration test (DataTrue would be an example for such a tool), or if you have a regression test that runs before go-live, is close to 0.
That is where I would want to be. Make sure that if it is wrong, it cannot go live undetected.
When I built the "Site Infrastructure Tests" framework and then added a Pupeteer version, back in the day, and when I was banging on about applying those in your CI/CD framework, if you have one, I was thinking about a 0 TTD.
I was thinking about a setup in which the Analytics team can be 100% sure that no one else breaks their stuff undetected.
Did I ever mention my favourite example? A retailer, where we tracked prices as net prices. They had a data layer, of sorts, and the net price was part of that DL. We were happy and it worked. Then, one day, revenue went up 19%. For a moment, some people were happy, but then someone noted that 19% is the same number as VAT in Germany, ha ha, what a coincidence!
Wait a minute...
As it turned out, a developer had decided that the price in the DL must have been wrong, and replaced it with gross price. They didn't tell anyone, and so that change went live, and boom.
We fixed it in Javascript, told the developer to please not do anything like that ever again.
Two weeks later, with the next release, you guessed it, they had "fixed" the change, obviously without telling us, again. So for a couple of hours, we tracked net revenue minus a percentage, then we fixed it, again.
Any test run during integration, or in UAT, would have caught that, easily.
"Yes, we could!"
I'm not precious about my framework, others can do it better. But it is sad, and a little surprising, that testing hasn't become an integral part of our culture by now.
Other people have made great progress, standardising implementations and making maintenance easier. Apollo, by the crazy guys at Search Discovery, is one such thing.
On top of that, we could, relatively easily, all have setups with 0 TTD! There is no technical reason why we couldn't.
Data is so important, and especially data that can be trusted!
If your TTD is weeks, years, or potentially infinity, how can you be sure that your analysis is correct? How much time and effort do you put into second guessing, double-checking, cross-checking, filtering, and massaging your data?
How difficult is it for you to get people to use your data? To convince them it is right?
Or, if I want to really call it out: what is your work worth if you cannot trust your data?
Ah, well, you know what I mean.
If there is no technical reason why we couldn't do this, and if a 0 TTD is such a valuable thing to aim for, what are we waiting for?
I would love to hear your reasons, and I have some ideas about what reasons people could have.
1 - Resources
You are part of a team that is already stretched. You do not have enough people, or enough means to really make a difference.
2 - Development don't want to, or they have no resources
The second most common reason, I think. You may have spoken with dev, and they are agreeing, but they simply have too much to do to accommodate you.
3 - Don't care
Maybe you don't care. Maybe you think that what you have is good enough.
Future
I am guessing that resources are the top problem, both in the Analytics team itself, and on the dev side.
The article I linked above has ideas about how to put a number against better data quality. Maybe it can help you make the case.
I fear that reason number 3 is more common than most of us want to admit. A lot of us probably "make it work", somehow.
The value of your work as an analyst, and sorry if that sounds brutal, but I believe it is the truth: your work is worth only as much as the data that you work with.
If you have rubbish data, you will never be able to deliver gold.
So, read that article, estimate your TTD and what it costs you, then change it!
Opinions? Other reasons?
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.