How to Choose a Big Data Quality Model

Big data is the collection of massive amounts of data — structured or unstructured; digital or manual — encompassing all the types of data that an organization generates that can ultimately be used to provide business intelligence. Of course, the most accurate insights hinge on assurances of big data quality — which can be a challenge in light of the growing volumes of big data being generated each day.

In this big data era, multiple people demand immediate access to the right data from every source. Your organization is facing the exact same issue with its data; you do not have enough resources to bring all this data in accurately, nor can you address the growing needs of the business users to deliver quality big data quickly.

Data quality management scenarios

Not long ago, the primary method of data delivery was batch processing, and the data was delivered in big chunks at certain intervals, and there would be time to process it before sharing with a few data consumers. The data would be housed in data warehouses and would be reviewed by a team of experienced data professionals armed with well-defined methodologies and well-known best practices. Finally, using a business intelligence tool, they define a semantic layer such as a “data catalog” and predefined reports. Data quality was easier to control, and after the review, the data could then be consumed for analytics.

We can compare this model to the encyclopedia model. Before we entered the 21st century, we had encyclopedias such as the Encyclopedia Britannica. Only a handful of experts could author the encyclopedia.

The Encyclopedia Britannica is written by about 100 FTE editors together with around highly skilled 4000 contributors, including Nobel prize winners and former US presidents. This level of contributions made the data quality of the encyclopedias exceptionally high. But the encyclopedia model fails to scale to big data quality today, where you always want accurate articles on every single topic, in your native language, for example.

Managing big data sprawl

Data lakes then came to the rescue as an agile approach for provisioning data. You generally start with a data lab approach targeting a few data-savvy data scientists. Using cloud infrastructure and big data, you can drastically accelerate the data ingestion process with raw data. Using schema on-read, data scientists can autonomously turn data into smart data.

The next step is to share this data with a wider audience, so you create a new data layer for analytics, targeting the business analyst community. Because you are targeting a wider audience with different roles, you then realize that you need to establish stronger big data quality and governance. Then, the third next step is to deliver information and insights to the whole organization. Again, the prerequisite is to establish another layer of data governance.

This more agile model has multiple advantages over the previous one. It scales across data sources, use cases, and audiences. Raw data can be ingested as it comes with minimal upfront implementation costs, while changes are straightforward to implement.

But through this approach, we created a big challenge, as data governance was not considered alongside this way of doing things. This is like what Facebook is currently experiencing with their data practices. They created a platform that has no limits in terms of content it can ingest and number of users and communities it can serve. But because anyone can enter any data without control, data quality and trust on Facebook is almost impossible to establish. Facebook says it will hire 10,000 employees in its trust and safety unit and plans to double those headcounts in the future.

Collaborative and governed big data

To ensure big data quality, it is important to work with data as a team from the start. Otherwise, you may become overwhelmed by the amount of work needed to validate trusted data. By introducing a Wikipedia-like approach where anyone can potentially collaborate in data curation, there is an opportunity to engage the business in contributing to the process of turning raw data into something that is trusted, documented, and ready to be shared.

By leveraging smart and workflow-driven self-service tools with embedded data quality controls, you can implement a system of trust that scales in this big data era.

IT and other support organizations such as the office of the CDO need to establish the rules and provide an authoritative approach for governance when it is required (for example for compliance, or data privacy.)

You need to establish a more collaborative approach to big data quality in parallel, so that the most knowledgeable among your business users can become content providers and curators.

Factor in the right people

Do not underestimate the human factor once you are ready to deliver your big data quality management project. Remember: your data quality project is the right combination of having the right people equipped with the right tools following the right approach. What is specific regarding data quality is that your team must come from different departments. You will not succeed if only IT is in charge. It’s the same for business, who will need IT knowledge, skills and governance to avoid any shadow IT and low governance projects.

Trust is at the heart of any big data quality project. And this also applies to every team member. Trust is not a given but rather a continuous perception; team members must feel they can rely with confidence on someone else to delegate a task assigned to anyone on the team. Trusted data always relies on trustworthy people. This is what we mean when we say data is a team sport; trust is the essence of a successful data management project. To create that trust among team members, we suggest applying the RUMBA methodology. Taught in Six Sigma Training, RUMBA is a powerful acronym that stands for Reasonable, Understandable, Measurable, Believable, and Achievable.

By applying this principle within your team from the start, you will make sure you will get not only commitment from the top, but also from your team, as you won’t discourage them from setting unreachable targets. Create lofty big data quality goals but make sure that your team has the tools to get there.

Consider the future

Keep fostering data responsibility. Technology, products, sensors, and objects to be created in the coming years will make big data quality even more pervasive. The data they use needs to be trustworthy, so that each of your company’s data professionals can feel committed and responsible for delivering good quality data that will help the business. This is a shared responsibility. Help data teams understand that quality data is an asset, and make sure they follow compliance training that will make them aware of your data’s value as well as the consequences of data misuse.

In this big data era, standalone data quality tools won’t cut it. You need solutions that work in real-time across all lines of business and don’t require data engineer-level knowledge to use. Talend Data Fabric is a suite of apps that includes data integration, preparation, and stewardship that enables business and IT to work together to create a single source of trusted data in the cloud, on-premises, or hybrid.

Ready to get started with Talend?