There are often times where standing on a proverbial soap box and saying how things ‘ought to be’, becomes tiresome. In the data world however, we commonly see big problems caused by small decisions, all of which could have easily been solved by making the right decision at an earlier time. The majority of these problems stem from data quality. That’s why I’m up on the box, just this once.
When data is held in small quantities, it can be easy to make up for anything that’s lacking in quality. You know your business and the volume of data is not unmanageable. When sales pick up however, and the data volumes increase, it can be a race against time to solve the issues before it becomes increasingly harder.
This post is not designed to shame those not following its contents or guarantee that this is the only way to approach data quality. Instead, it is a set of observations I have gained over time and detail the headaches others have found when falling foul of them.
1. Free Text at Your Own Peril
When it comes to data entry, system designers can be tempted to think the more options the better. While it’s true that allowing an input to be made using free text can mean the user can include the most amount of detail, it can quickly become a nightmare for anyone working downstream of that data.
Let’s say for example your business sells custom pieces of jewellery. Each of your products can be different so when it comes to recording those orders, a business uses a free text field to identify what was made for the customer. In its infancy when orders start out at a trickle, business reporting can be done in a small spreadsheet, the values can be summed and pie charts made.
Say you want to report on how much money was made from necklaces compared to bracelets, it may be that you have religiously added those words in the exact same way every time you processed an order. With a bit of excel magic or human input you can quickly categorise each piece and set up a report.
But what happens when you start to grow and there are more people entering in the orders? Spelling mistakes can be made, and inconsistencies start to appear. The free text field quickly becomes unusable for reporting and a new method must be put in place. Consequently, someone is left with the task of backdating all the information for all previous orders.
So, what can be done to stop this from happening? Structured text is a real must when it comes to producing meaningful data that can be used for reporting. Allowing the user to choose from a select number of finite options for each record makes anyone using the data afterwards very happy. Not enough options in a list? Not a problem, additional ‘sub-types’ can be created to capture that additional information.
This rule isn’t about completely abolishing the use of free text in record keeping, more to highlight where it is not useful. An item description field is a good example of where a free text field would be suitable, as there are other fields that do the job of maintaining consistency for categorisation.
2. The Power of Linking by Numerical ID
Businesses are complex and can have many systems for different functions. It’s very common to have separate systems for commerce and customer relationship management (CRM). Even within a single system, the number of tables with important information can often be greater than ten.
Whether you need to build a report or investigate an issue, joining data from multiple tables is commonplace to get to the right answer. Most people have that used data at a business have had the frustrating experience of a VLOOKUP in excel not working due to spelling or formatting issues. Now imagine this at the enterprise scale. It’s quite easy to see how quickly a task becomes impossible for this reason.
As with most issues, prevention is the key to solving future headaches. Where possible, always make sure that important lists like departments or items have a numerical ID. Numbers are your biggest ally in big data when it comes to analysing or linking. They’re far less likely to cause issues like inconsistent capital letters, spaces and typos.
While it’s not the most glamorous topic in the world of data, getting this point alone right, could save countless hours of late-night data wrangling when proper prevention would have had you log off and enjoying your evening.
3. What Goes in....Comes Out
The age-old adage of not being able to polish certain things is definitely true a lot of the time, but especially true when it comes to data.
We have reached a very exciting time in the world of data where complex multi-faceted models can be made very quickly to predict trends, understand problems, optimise strategy and make informed recommendations. The computational power available in modern laptops, combined with freely available data science packages means that any person or business can incorporate these with the right know-how.
There is little mention of the required quality of the data that gets fed into these models. Businesses are often enticed by services promising to leverage the “latest AI technologies” that will let you become fully “data-led” and a company of the future, but they fail to mention that if poor quality data goes in, the results or insights gained from them will be equally as poor quality.
This point of this topic isn’t to say that these services are completely out of reach for those with incorrect or incomplete data. There are many amazing companies out there that can work hand in hand with your business to get your data into the quality required. Better yet, you can instil the practises in this article to get you to that place too. It is important however, to be wary of companies trying to sell you oil that has vague hissing sound.
4. Start As You Mean to Go On
While risking starting another point with a popular saying, the phrase “A stitch in time saves nine” fits perfectly. Small changes to your data far along in your business journey can have big implications.
Say you suddenly decide that it’s now important to track more information about each customer or an order, what happens to the old data that you held before that change? There are three options you have at this point.
- The first is the hardest, you can manually backfill the older data with the required new information (if you have it). This will mean days to weeks of a person’s time going through and updating each record in turn. It’s not fun for them, and it’s not fun for your business to miss out on what they would be normally doing in that time.
- The second is the easiest but comes with a pre-requisite. Enriching the data with data from an external source. Provided the data is available from said source, there are many companies that offer this type of service though it’s rarely the case that there is a magical dataset that suits your needs. Usually, the only place that data can come from is your own business.
- The third choice is the worst of the options. Ignoring the records that were created before the change was made can have wide reaching consequences for reporting, modelling and business management. Simply cutting chunks out of your historical data will not suddenly solve the problem.
While those are the options that you face in that situation, prevention is the path of least resistance. Each time a new part of your data model is created, it pays dividends to predict what information you need to capture, not just now, but for the future also.
5. If it’s Important, Capture It!
With the modernisation of the standard business tech stack, fewer decisions are made on gut-feel and more are based directly on observations in the data. It’s much easier to argue your business should focus or invest money in a certain area when you have the data to back up your point. Whether it’s growth in a new niche sector or a trend in a certain customer demographic, having the data at the right time is invaluable.
For a business to be truly data-led, it needs the data. As shown in the points above, it needs to be captured in the correct way, however, it’s crucial that if it is important, the data is captured in the first place. It’s a precious resource that can mean the difference between a positive and a negative quarter for your business. It may seem annoying to have to capture data that you don’t currently use, but rest assured, further down the line, that information can be included in models, tracked by performance indicators and influence key business decisions.
Set yourself up for success
It’s very easy for armchair experts to sit back and tell you how things should be done in a perfect world. In reality we know that decisions often need to be made quickly without all the available knowledge and facts. The rules in this post are all set in an ideal world and if you can get close to following a few when the decisions come up, you will be thanking yourself further down the line.
If you are a business that have one or many of the issues mentioned above, why not reach out to The Data Refinery team. We’d love to have a chat with you and see whether we can help you maximise the full potential of your data.