How to Achieve Data Quality using Talend
Data Quality (DQ) is an art form. If I was to define two simple rules for a project involving some element of DQ, they would be:
- Don’t underestimate the time and effort required to get it right
- Talk to the people who own and use the data and get continuous feedback!
In many cases the DQ role on a project is a full-time role. It is a continuous process of rule refinement, results analysis and discussion with the users and owners of the data. Typically the DI (Data Integration) developers will build a framework into which DQ rules can be inserted, adjusted and tested constantly without the need to rewrite the rest of the DI processes each time.
This blog focuses on the process of matching data, but many of the principles can also be used in other DQ tasks.
First of all, lets understand what we are trying to achieve – why might we want to match?
– To find the same ‘entity’ (be it product, customer, etc.) in different source systems
– To de-duplicate data within a single system – or at least identify duplicates within a system for a Data Steward to be able to take some sort of action
– As part of a process of building a single version of the truth, possibly as part of an MDM (Master Data Management) initiative
– As part of a data-entry process to avoid the creation of duplicate data
To help us with these tasks I propose a simple DQ methodology:
As you can see, this is an iterative process and, as I said earlier, you are unlikely to find that just one ‘round’ of matching achieves the desired results.
Let’s look at each step on the diagram:
Before we can attempt to match, we must first understand our data. We employ two different strategies to achieve this:
– Reading relevant documentation
– Talking to stakeholders, end users, system administrators, data stewards, etc.
– Source system demonstrations
– Discussing the change in data over time
- Technical Profiling
– Using Talend tools to test assumptions and explore the data
– Analyse the actual change in data over time!
Both strategies must be employed to ensure success. One thing I have found is that end users of systems are constantly finding ways to ‘bend’ systems to do things that business teams need to do, but perhaps that the system wasn’t designed to do. A simple example of this would be a system that doesn’t include a tick box that let’s call centre operators know that a customer does not want to be contacted by phone. You may find that another field which allows free text has been co-opted for this purpose e.g.
Adam Pemble ****DO NOT PHONE****
This is why we cannot rely on the system documentation alone. A combination of consulting the users and technical profiling would help us identify this unexpected use of a field.
Typically for this step I would start by listing every assumption and query about the data – you should have at least one thing for every field of every table and file, plus row and table level rules. Next, design Talend Profiler Analyses and /or Data Integration Jobs to test those assumptions. These results will then be discussed with the business users and owners of the data. The reports produced by the DQ profiler can be a great way to share information with these business users, especially if they are not very technical. DI can also produce results in formats familiar to business users e.g. spreadsheets.
Specific to the task of matching, some examples of assumptions we may wish to test:
- “In source system A, every Customer has at least one address and every address has at least one associated customer”
- “Every product should have a colour, perhaps we can use that in our matching rules? The colour field is free text in the source system.”
- “Source systems A and B both have an email address field – can we match on that?”
- “Source system X contains a lot of duplicates in the Customer table”
It is also important to analyse the lineage of each piece of data. For example, say we had an email address field. We may profile it and discover that it contains 100% correctly formatted email addresses. Is this because the front-end system is enforcing this, or is it just by chance? If it is the latter, our DI jobs may need to be written to cope with the possibility of an incorrect or missing email, even though none currently exist.
Note: I may write a future blog going into more detail about the importance of analysis before beginning to write any process logic.
Whilst performing the Analysis stage, it is likely that we will notice things about our data that will have an impact on our ability to match records. For example, we might profile a ‘colour’ column for a product and find results similar to those shown below:
What do we notice?
– Blk is an abbreviation of Black
– Chr is an abbreviation of Chrome
– Aqua MAY be a synonym of Blue
– Blu is a typo of Blue
If we were to do an exact match based on colour, some matches could be missed. Fuzzy matching could introduce false positives (more on this later).
To improve our matching accuracy, we need to standardise our data BEFORE attempting to match. Talend allows you to apply a number of different standardisation techniques including:
– Synonym indexes
– Reference data lookups
– Phone Number Standardisation
– Address Validation tools
– Other external validation / enrichment tools
– Grammar-based parsers
Let’s look at each of these in turn:
Our scenario with colours would be a classic use case for a synonym index. Simply a synonym index is a list of ‘words’ (i.e. our master values) and synonyms (related terms that we would like to standardise or convert to our master value). For example:
The advantages of using Lucene is that it is fast, it is an open standard and that we can leverage Lucene’s fuzzy search capabilities, so we can in some cases cater for synonyms that we can’t predict at design time (e.g. typos).
These jobs are in the Talend demo DQ project if you want to play with them. The indexes can also be utilised in the tStandardizeRow component, which we will discuss shortly.
Reference data lookups
A classic DI lookup to a database table or other source of reference data e.g. an MDM system. Typically used as a join in a tMap.
Phone Number Standardisation
There is a handy component in the Talend DQ pallet that you should know about: tStandardizePhoneNumber.
Address Validation tools
Talend provides components with our DQ offerings that allow you to interface to an external address validation tool such as Experian QAS, Loqate or MelissaData. Why would you want to do this? Well, if you are pulling address data from different systems, the likelihood is that the address data will be held in differing formats e.g. Address1, Address2 etc. vs Building, Street, Locality, etc. These formats may have different rules for governing the input of addresses – from no rules (free text) to validating against an external source. Even if two addresses are held in the same structure, there is no guarantee that the individual ‘tokens’ of the address will be in the same place or have the same level of completeness. This is where address validation tools come in. They take in an address in its raw form from a source and then using sophisticated algorithms, standardize and match the address against an external reference source like a PAF file (Post Office Address file) – a file from the postal organisation of a country that contains all addresses, updated on a regular basis. The result is returned in a well-structured and most importantly consistent format, with additional information such as geospatial information and match scores. Take the example below:
Two addresses that are quite different from each other to a computer; however, to a human, we can see that they are the same address. Running the addresses through an address validation tool (in this case Loqate) we get the same, standardised address as an output. Now our matching problem is much simpler.
You might ask – can I not build something like this with Talend rather than buy another tool? I was once part of a project where this was attempted (not with Talend, it was a different tool), which had quite poor-quality addresses. The issue is that addresses were designed to allow a human to deliver a letter to a given place, and there is a great deal of variation in how addresses can be represented. Six months of consultancy later, we had something that worked in most cases, but of course it was then realised that it would have been cheaper to buy a tool…. Why is address validation not built into Talend you might ask? There are a number of reasons:
– Not all customers require Address Validation – it would make the product more expensive
– Those customers that do may already be using one of the major vendors, they don’t want a different Talend proprietary system
– Different tools on the market suit different needs – e.g. MelissaData is centred on the US
– Why should Talend re-invent the wheel, when we could just allow you to utilise existing ‘best of breed’ solutions?
Other external validation / enrichment tools
There are many tools available on the market, most of which are easy to interface with using Talend (typically via a webservice or api). For example Dun and Bradstreet is a popular source of company data and Experian provides credit information on individuals. All of this data could be useful to an organisation in general and also potentially useful in matching processes.
Sometimes we will come across a data field that has been entered as free text, but could contain multiple bits of useful information that we could use to match, if only we could extract it consistently. Take for example a field that holds a product description:
34-9923 Monolithic Membrane 4′ x 8′ 26 lbs
4′ x 8′ Monolithic Membrane 26lbs Green
Now again, as a human, we can see that there is a high likelihood that these two descriptions are referring to the same product. What we need to be able to do is identify all of the different ‘tokens’: Product code, Name, Dimensions, Weight, and Colour – and create a set of rules to be able to ‘parse’ out these tokens, no matter the order or variations in representation (e.g. the spaces in 26 lbs but not in 26lbs). Essentially, what we are defining is some simple rules for a language or ‘grammar’. Talend includes a variety of parsing components which can help you, from simple regular expressions through to tStandardiseRow, which lets you construct an ANTLR grammar:
A warning though: this is a hard task for even experienced professionals to get right. We are looking to include some additional intelligence in the tool to help you with building these sorts of rules in the future.
Next time: This blog continues with part 2: matching and survivorship / stewardship
BigData Dimension is a leading provider of cloud and on-premise solutions for BigData Lake Analytics, Cloud Data Lake Analytics, Talend Custom Solution, Data Replication, Data Quality, Master Data Management (MDM), Business Analytics, and custom mobile, application, and web solutions. BigData Dimension equips organizations with cutting edge technology and analytics capabilities, all integrated by our market- leading professionals. Through our Data Analytics expertise, we enable our customers to see the right information to make the decisions they need to make on a daily basis. We excel in out-of-the-box thinking to answer your toughest business challenges.
You’ve already invested in Talend project or maybe you already have a Talend Solution implemented, but may not be utilizing the full power of the solution. To get the full value of the product, you need to get the solution implemented from industry experts.
At BigData Dimension, we have experience spanning over a decade integrating technologies around Data Analytics. As far as Talend goes, we’re one of the few best-of-breed Talend-focused systems integrators in the entire world. So when it comes to your Talend deployment and getting the most out of it, we’re here for you with unmatched expertise.
Our work covers many different industries including Healthcare, Travel, Education, Telecommunications, Retail, Finance, and Human Resources.
We offer flexible delivery models to meet your needs and budget, including onshore and offshore resources. We can deploy and scale our talented experts within two weeks.
- Full requirements analysis of your infrastructure
- Implementation, deployment, training, and ongoing services both cloud-based and/or on-premise
- BigData Management by Talend: Leverage Talend Big Data and its built-in extensions for NoSQL, Hadoop, and MapReduce. This can be done either on-premise or in the cloud to meet your requirements around Data Quality, Data Integration, and Data Mastery
- Cloud Integration and Data Replication: We specialize in integrating and replicating data into Redshift, Azure, Vertica, and other data warehousing technologies through customized revolutionary products and processes.
- ETL / Data Integration and Conversion: Ask us about our groundbreaking product for ETL-DW! Our experience and custom products we’ve built for ETL-DI through Talend will give you a new level of speed and scalability
- Data Quality by Talend: From mapping, profiling, and establishing data quality rules, we’ll help you get the right support mechanisms setup for your enterprise
- Integrate Your Applications: Talend Enterprise Service Bus can be leveraged for your enterprise’s data integration strategy, allowing you to tie together many different data-related technologies, and get them to all talk and work together
- Master Data Management by Talend: We provide end-to-end capabilities and experience to master your data through architecting and deploying Talend MDM. We tailor the deployment to drive the best result for your specific industry – Retail, Financial, Healthcare, Insurance, Technology, Travel, Telecommunications, and others
- Business Process Management: Our expertise in Talend Open Studio will lead the way for your organization’s overall BPM strategy
As a leading Systems Integrator with years of expertise in the latest and greatest integrating numerous IT technologies, we help you work smarter, not harder, and at a better Total Cost of Ownership. Our resources are based throughout the United States and around the world. We have subject matter expertise in numerous industries and solving IT and business challenges.
We blend all types of data and transform it into meaningful insights by creating high performance Big Data Lakes, MDM, BI, Cloud, and Mobility Solutions.
CloudCDC is equipped with the most intuitive and user friendly interface. With in a couple of clicks, you can load, transfer and replicate data to any platforms without any hassle. Do not worry about codes or scripts.
• Build Data Lake on AWS, Azure and Hadoop
• Continuous Real Time Data Sync.
• Click-to-replicate user interface.
• Automated Integration & Data Type Mapping.
• Automated Schema Build.
• Codeless Development Environment.
CONTACT THE EXPERTS AT BIGDATA DIMENSION FOR YOUR CLOUDCDC, TALEND, DATA ANALYTICS, AND BIG DATA NEEDS. CONTACT US TODAY TO LEARN MORE!