How to Turn Text into Data Using Talend
In this example, we have a set of XML files which represent a response to a stock update (fabricated example). The XML file has one critical field in it which is a text based description of the result of the transaction. We need to analyse the transaction results, and so need the data in a structured format.
The interesting part of the process was applying information from one row down to later rows, using a “tJavaFlex” component to hold onto some “state” as the process runs.
In this example, we have a field with the data we need to parse which is the field “<result_summary>“. You can see below that this field tells us what action was taken during a transaction.
- The “Product Category” section of the product had a change to the “Classification”
- The “Nutritional Information” section of the product had an addition and a removal of an ingredient, as well as a change to an ingredient description
- The primary keys are provided in angle brackets 
We need to get this data into this format, the key features being we have separated out the action being taken, the field the action was taken against, the primary key if available and the “section” the change occurred in.
The end result looks like this, the important fields are highlighted in yellow.
The hardest thing conceptually to resolve is how we apply the “section” value to the relevant rows of data. In effect, we have to extract the section name from one row, and then apply it to subsequent rows (and identify when a new section has been entered).
We will break this XML text field into individual rows of data using a “tNormalize” component, splitting on an “end of line”…then we will process each line of data.
We have to somehow take these “section” values:
…and apply them to the lines below, to end up like this:
Conceptually we have to somehow “know” when we are entering a section, and then hold onto the section name so we can apply it to the next row(s).
For this, we have to use a tJavaFlex to remember the section we are processing.
In this example, the XML field has a lot of structure we can use. It has semi-colons we can use to break things up, and each set of changes is introduced by a section name. Each transaction record is prefixed with the type of change (such as “Remove”,”Add”), and primary keys are always provided in angle brackets.
So the high-level process is:
- Break the XML field (result_summary) into rows based on “end of line” character using “tNormalize”
- Filter out empty rows
- Filter out rows, not in the format “Text: More Text”
- Detect when we enter a “Section” and hold on to the section name
- Split the fields using substring and split functions (using “:” and “[“)
- Remove the “Section: Section Name” row from the final output
Step 1: Iterate over the XML files
We delete the output file using a “tFileDelete” if it already exists and iterate over the XML files using a “tFileList” component:
Step 2: Read the XML file
We use a “tFileInputXML” component to read in the XML file for the current iteration. This is a simple XML schema and easy to map. The important field we need to parse data from is the “result_summary” field:
We need to cause the “result_summary” field to generate individual rows of data for each line of text. For this, we use a “tNormalize” component to break the field into separate rows based on the “end of line” character (“\n” in Java on Windows in this case).
So now we are generating multiple records in the downstream flow based on the number of lines in the “result_summary” field:
Step 4: Filter out unwanted rows
We only want rows in the form “Text : More Text”. This will keep hold of the section headings, and remove noise like “Product Changes:” (as it has a colon at the end, not in the middle of the text). We also get rid of empty rows.
For this we use a “tFilterRow” component, with some regex in the Match formula to only allow values with a “:” in the middle to flow through:
Step 5: Add a tJavaFlex and update the schema
The “tJavaFlex” will perform the bulk of the work. We add fields to the schema here for the individual fields we plan to extract/calculate (“section”, “action”, “field”, “code”), as well as a field for the XML file name:
Step 6: Define the variables in the Start section of tJavaFlex
These variables will hold the calculated field values we are trying to extract, this includes the section name variable which will hold on to the current section being processed.
Note that the scope of these variables (sub job level) means we can calculate the section name while processing the “Section: Section Name” row, and retain it for when we are processing later rows.
Step 7: Grab the file name and clear the variables
The main code is executed for every input row. We clear the variables, apart from the “section” variable.
Note: We only want to change the section variable when we enter a new section.
Step 8: Work out if the row being processed is a Section row or not
If the value being processed starts with “Section”, then split the row based on the “:” character, and store the section name in the variable so it can be “remembered” and applied in subsequent rows. The section name after we “split” to an array will be the second item in the array (index ).
If the value doesn’t start with “Section” then it must be a value indicating a transaction. e.g. “Remove:Ingredient [ 1817567 ]”. We, therefore, can split this again using the “:” character, and take the action (e.g. Remove) and the field (e.g. Ingredient [ 1817567 ]) into the “action” and “field” variables:
Step 9: Extract any primary keys
Check if we have any “primary keys” which will be inside square brackets and extract out the value into the “code” variable.
We use a combination of standard String methods (“contains”, “substring” and “indexOf”) to extract the code from any square brackets.
We may as well remove the primary key from the “field” value we have already extracted, i.e. “Ingredient [ 1817567 ]” becomes “Ingredient”. If we are extracting the primary key into a separate “code” field, we don’t need it as part of the field name:
Step 10: Write the calculated fields into the output flow
Step 11: Discard the “Section” line
We only needed the section line (“Section: Nutritional Information”) so we could grab the section name and apply it to other rows. We can discard this row now using a “tFilterRow” component.
We simply filter out values which don’t start with “Section” (note the “!” at the start):
Last Step: Write the value out to a file
We have extracted the data out into a structured format, and have used “tJavaFlex” to detect state changes (entering a new “Section”) – allowing us to apply data across rows.
We could likely do this with a “tMemorizeRows” component somehow, but that’s for another time. I am guessing it wouldn’t be straightforward as we wouldn’t know in advance how many rows to memorize…
BigData Dimension is a leading provider of cloud and on-premise solutions for BigData Lake Analytics, Cloud Data Lake Analytics, Talend Custom Solution, Data Replication, Data Quality, Master Data Management (MDM), Business Analytics, and custom mobile, application, and web solutions. BigData Dimension equips organizations with cutting edge technology and analytics capabilities, all integrated by our market- leading professionals. Through our Data Analytics expertise, we enable our customers to see the right information to make the decisions they need to make on a daily basis. We excel in out-of-the-box thinking to answer your toughest business challenges.
You’ve already invested in Talend project or maybe you already have a Talend Solution implemented, but may not be utilizing the full power of the solution. To get the full value of the product, you need to get the solution implemented from industry experts.
At BigData Dimension, we have experience spanning over a decade integrating technologies around Data Analytics. As far as Talend goes, we’re one of the few best-of-breed Talend-focused systems integrators in the entire world. So when it comes to your Talend deployment and getting the most out of it, we’re here for you with unmatched expertise.
Our work covers many different industries including Healthcare, Travel, Education, Telecommunications, Retail, Finance, and Human Resources.
We offer flexible delivery models to meet your needs and budget, including onshore and offshore resources. We can deploy and scale our talented experts within two weeks.
- Full requirements analysis of your infrastructure
- Implementation, deployment, training, and ongoing services both cloud-based and/or on-premise
- BigData Management by Talend: Leverage Talend Big Data and its built-in extensions for NoSQL, Hadoop, and MapReduce. This can be done either on-premise or in the cloud to meet your requirements around Data Quality, Data Integration, and Data Mastery
- Cloud Integration and Data Replication: We specialize in integrating and replicating data into Redshift, Azure, Vertica, and other data warehousing technologies through customized revolutionary products and processes.
- ETL / Data Integration and Conversion: Ask us about our groundbreaking product for ETL-DW! Our experience and custom products we’ve built for ETL-DI through Talend will give you a new level of speed and scalability
- Data Quality by Talend: From mapping, profiling, and establishing data quality rules, we’ll help you get the right support mechanisms setup for your enterprise
- Integrate Your Applications: Talend Enterprise Service Bus can be leveraged for your enterprise’s data integration strategy, allowing you to tie together many different data-related technologies, and get them to all talk and work together
- Master Data Management by Talend: We provide end-to-end capabilities and experience to master your data through architecting and deploying Talend MDM. We tailor the deployment to drive the best result for your specific industry – Retail, Financial, Healthcare, Insurance, Technology, Travel, Telecommunications, and others
- Business Process Management: Our expertise in Talend Open Studio will lead the way for your organization’s overall BPM strategy
As a leading Systems Integrator with years of expertise in the latest and greatest integrating numerous IT technologies, we help you work smarter, not harder, and at a better Total Cost of Ownership. Our resources are based throughout the United States and around the world. We have subject matter expertise in numerous industries and solving IT and business challenges.
We blend all types of data and transform it into meaningful insights by creating high performance Big Data Lakes, MDM, BI, Cloud, and Mobility Solutions.
CloudCDC is equipped with the most intuitive and user friendly interface. With in a couple of clicks, you can load, transfer and replicate data to any platforms without any hassle. Do not worry about codes or scripts.
• Build Data Lake on AWS, Azure and Hadoop
• Continuous Real Time Data Sync.
• Click-to-replicate user interface.
• Automated Integration & Data Type Mapping.
• Automated Schema Build.
• Codeless Development Environment.
CONTACT THE EXPERTS AT BIGDATA DIMENSION FOR YOUR CLOUDCDC, TALEND, DATA ANALYTICS, AND BIG DATA NEEDS. CONTACT US TODAY TO LEARN MORE!