• LinkedIn

  • Follow via Facebook

  • Follow via Twitter

  • Submit RFP

  • Contact Us

What are the Best Practices for Talend Job

Posted by BDD Talend Practice
Category:

DesignPatterns

This Blog Explains ‘Best Practices’ for Talend Job Designs.

Code Routines

On occasion, Talend components just don’t satisfy a particular pro-grammatical need.  That’s Ok; Talend is a Java Code Generator, right?  Sure it is, and there are even Java components available you can place on your canvas to incorporate pure-java into the process and/or data flow.  But what happens if even that is not enough?  Let me introduce you to my little friend: Code Routines!  Actual Java methods you can add to your project repository.  Essentially user defined java functions you code and utilized in various places throughout your job.

Talend provides many java functions you’ve probably already utilized; like:

– getCurrentDate()

– sequence(String seqName, int startValue, int step)

– ISNULL(object variable)

There are many things you can do with code routines when you consider the big picture of your job, project, and use case.

 

Repository Schemas

The Metadata section of the project repository provides a fortuitous opportunity to create reusable objects; a significant development guideline; Remember?  Repository Schemas present a powerful technique for creating

 

– File Schemas   –   used for mapping a variety of flat file formats, including:

  • Delimited
  • Positional
  • Regex
  • XML
  • Excel
  • JSON

– Generic Schemas     –   used for mapping a variety of record structures

– WDSL Schemas         –   used for mapping Web Service method structures

– LDAP Schemas          –   used for mapping an LDAP structure (LDIF also available)

– UN/EDIFACT              –   used for mapping a wide variety of EDI transaction structures

When you create a schema you give it an object name, purpose, and description, plus a metadata object name which is referenced in job code.  By default this is called ‘metadata’; take time to define a naming convention for these objects or everything in your code appears to have the same name.  Perhaps ‘md_{objectname}’ is sensible.  Take a look at the example.

Generic schemas are of particular importance as this is where you create data structures that focus on particular needs.  Take as an example a Db Connection (as seen in the same example) which has reverse engineered table schemas from a physical database connection.  The ‘accounts’ table has 12 columns, yet a matching generic schema defined below has 16 columns.  The extra columns account for added value elements to the ‘accounts’ table and used in a job data flow process to incorporate additional data.  In reverse, perhaps a database table has over 100 columns and for a particular job data flow only ten are needed.  A generic schema can be defined for the ten columns for a query against the table with the matching ten columns; A very useful capability.  My advice: Use Generic Schemas – A LOT; except for maybe 1 column structures; makes sense to me to simply make them built-in.

Note that other connection types like SAP, Salesforce, NoSQL, and Hadoop clusters all have the ability to contain schema definitions too.

Log4J

Apache Log4J has been available since Talend v6.0.1 and provides a robust Java logging framework.  All Talend components now fully support Log4J services enhancing the error handling methodology.

To utilize Log4J it must be enabled.  Do this in the project properties section.  There, you can also adapt your teams logging guidelines to provide a consistent messaging paradigm for the Console (stderr/stdout) and LogStash appenders.  Having this single location to define these appenders provides a simple way to incorporate Log4J functionality in Talend Jobs.  Notice that the level values incorporated in the Log4J syntax match up with the already familiar priorities of INFO/WARN/ERROR/FATAL.

On the Talend Administrator Console (TAC) when you create a task to run a job, you can enable which level of priority Log4J will log too.  Ensure that you set this appropriately for DEV/TEST & PROD environments.  The best practice is to set DEV/TEST to INFO level, UAT to WARN, and PROD to ERROR.  Any level above that will be included as well.

Working together with tWarn and tDie components and the new Log Server, Log4J can really enhance the monitoring and tracking of job executions.  Use this feature and establish a development guideline for your team.

Activity Monitoring Console (AMC)

Talend provides an integrated add-on tool for enhanced monitoring of job execution which consolidates collected activity of detailed processing information into a database.  A Graphical interface is included; accessed from the Studio and the TAC. This facility helps developers and administrators understand component and job interactions; prevent unexpected faults, and support important system management decisions.  But you need to install the AMC database and web app; it is an optional feature.

The AMC database contains three tables which include:

– tLogCatcher               –   captures data sent from Java exceptions or the tWarn/tDie components

– tStatCatcher              –   captures data sent from tStatCatcher Statistics check box on individual components

– tFlowMeterCatcher –   captures data sent from the tFlowMeter component

These tables store the data for the AMC UI which provides a robust visualization of a job’s activity based on this data.  Make sure to choose the proper log priority settings on the project preferences tab and consider carefully any data restrictions placed on job executions for each environment, DEV/TEST/PROD.  Use the Main Chart view to help identify and analyze bottlenecks in the job design before pushing a release into PROD environments.  Review the Error Report view to analyze the proportion of errors occurring for a specified timeframe.

While quite useful this is not the only use for these tables.  As they are indeed tables in a database, SQL queries can be written to pull valuable information externally.  Setup with scripting tools it is possible to craft automated queries and notifications when certain conditions occur and are logged in the AMC database.

Recovery Checkpoints

So you have a long running job?  Perhaps it involves several critical steps and if any particular step fails, starting over can become very problematic.  It would certainly be nice to minimize the effort and time needed to restart the job at a specified point in the job flow just before an error has occurred.  Well, the TAC provides a specialized execution restoration facility when a job encounters errors.  Placed strategically and with forethought, jobs designed with these
‘recovery checkpoints  can pick up execution without starting over and continue processing.

When a failure occurs, use the TAC ‘Error Recovery Management’ tab to determine the error and there you can launch the job for continued processing; Great stuff, right?

Joblets

Joblets are reusable job code that can be ‘included’ in a job or many jobs as needed, but what are they really?  In fact, there are not many use cases for Joblets however when you find one, use it; it is likely a gem.   There are different ways you can construct and use Joblets.  Let’s take a look, shall we?

When you create a new Joblet, Input/Output components are automatically added to your canvas.  This jumpstart allows you to assign the schemas coming in from and going out to the job workflow utilizing the Joblet.  This typical use of Joblets provide for the passing of data through the Joblet and what you do inside it is up to you.  In the following example, a row is passed in, and a database table is updated, logged to stdout, and then passing the same row unchanged (in this case), out.

 

Component Test Cases

Well if you are still using a release of Talend prior to v6.0.1 then you can ignore this.  LOL, or simply upgrade!  One of my favorite new features is the ability to create test cases in a job.  Now these are not exactly ‘unit tests’ however they are component tests; actual jobs tied into the parent job, and specifically the component it is testing.  Not all components support test cases, yet where a component takes a data flow input and pushes it out, then a test case is possible.

To create a component test case, simply right click the selected component and find the menu option at the bottom ‘create test case’.  After selecting this option, a new job is generated and will open up presenting a functional template for the test case.  The component under test is there along with built-in INPUT and OUTPUT components wrapped up by a data flow that simply reads an ‘Input File’, processes the data from it and passing the records into the component under test, which then does what it does and writes out the result to a new ‘Result File’.  Once completed that file is compared with an expected result, or ‘Reference File’.  It either matches or not: Pass or Fail!  – Simple right?

Well let’s take a look, shall we?

The below job  has a tJavaFlex component that manipulates the data flow passing it downstream for further processing.

A Test Case job has been created which looks like this: No modifications are required (but I did cleanup the canvas a bit.

It is important to know that while you can modify the test case job code, changing the component under test should only occur in the parent job.  Say for instance the schema needs to be changed.  Change it in the parent job.

Note that once a test case ‘instance’ is created, multiple ‘input’ and ‘reference’ files can be created to run through the same test case job.  This enables testing of good, bad, small, large, and/or specialized test data.  The recommendation here is to evaluate carefullynot only what to test but also what test data to use.

Finally, when the Nexus Artifact Repository is utilized, and test case jobs are stored there along with their parent job, it is possible to use tools like Jenkins to automate the execution of these tests, and thus the determination of whether a job is ready to promote into the next environment.

Data Flow ‘Iterations’

Surely you have noticed having done any Talend code development that you link components together with a ‘trigger’ process or a ‘row’ data flow connector.  By right clicking on the starting component and connecting the link ‘line’ to the next component you establish this linkage.  Process Flow links are either ‘OnSubJobOk/ERROR’, ‘OnComponentOK/ERROR’, or ‘RunIF’ and we covered these in my previous blog.  The Data Flow links, when connected are dynamically named ‘row{x}’ where ‘x’, a number, is assigned dynamically by Talend to create a unique object/row name.  These data flow links can have custom names of course (a naming convention best practice), but establishing this link essentially maps the data schema from one component to the other and represents the ‘pipeline’ through which data is passed.  At runtime data passed over this linkage is often referred to as a dataset.  Depending upon downstream components the complete dataset is processed end-to-end within the encapsulated sub job.

Not all dataset processing can be done all at once like this, and it is necessary sometimes to control the data flow directly.  This is done through the control of ‘row-by-row’ processing, or ‘iterations.’ Review the following nonsensical code:

Notice the components tIterateToFlow and tFlowToIterate.  These specialized components allow you to place control over data flow processing by allowing data sets to be iterated over, row-by-row.  This ‘list-based’ processing can be quite useful when needed.  Be careful however in that in many cases once you break a data flow into row-by-row iterations you may have to re-collect it back into a full dataset before processing can continue (like the tMap shown).  This is due to the requirement that some components force a ‘row’ dataset flow and are unable to handle an ‘iterative’ dataset flow.  Note also that t{DB}Input components offer both a ‘main’ and ‘iterate’ a data flow option on the row menu.

 

ABOUT BIG DATA DIMENSION

BigData Dimension is a leading provider of cloud and on-premise solutions for BigData Lake Analytics, Cloud Data Lake Analytics, Talend Custom Solution, Data Replication, Data Quality, Master Data Management (MDM), Business Analytics, and custom mobile, application, and web solutions. BigData Dimension equips organizations with cutting edge technology and analytics capabilities, all integrated by our market- leading professionals. Through our Data Analytics expertise, we enable our customers to see the right information to make the decisions they need to make on a daily basis. We excel in out-of-the-box thinking to answer your toughest business challenges.

Talend Unified Solution

You’ve already invested in Talend project or maybe you already have a Talend Solution implemented, but may not be utilizing the full power of the solution. To get the full value of the product, you need to get the solution implemented from industry experts.

At BigData Dimension, we have experience spanning over a decade integrating technologies around Data Analytics. As far as Talend goes, we’re one of the few best-of-breed Talend-focused systems integrators in the entire world. So when it comes to your Talend deployment and getting the most out of it, we’re here for you with unmatched expertise.

Our work covers many different industries including Healthcare, Travel, Education, Telecommunications, Retail, Finance, and Human Resources.

We offer flexible delivery models to meet your needs and budget, including onshore and offshore resources. We can deploy and scale our talented experts within two weeks.

GETTING STARTED

  • Full requirements analysis of your infrastructure
  • Implementation, deployment, training, and ongoing services both cloud-based and/or on-premise

MEETING YOUR VARIOUS NEEDS

    • BigData Management by Talend: Leverage Talend Big Data and its built-in extensions for NoSQL, Hadoop, and MapReduce. This can be done either on-premise or in the cloud to meet your requirements around Data Quality, Data Integration, and Data Mastery
    • Cloud Integration and Data Replication: We specialize in integrating and replicating data into Redshift, Azure, Vertica, and other data warehousing technologies through customized revolutionary products and processes.
    • ETL / Data Integration and Conversion: Ask us about our groundbreaking product for ETL-DW! Our experience and custom products we’ve built for ETL-DI through Talend will give you a new level of speed and scalability
    • Data Quality by Talend: From mapping, profiling, and establishing data quality rules, we’ll help you get the right support mechanisms setup for your enterprise
    • Integrate Your Applications: Talend Enterprise Service Bus can be leveraged for your enterprise’s data integration strategy, allowing you to tie together many different data-related technologies, and get them to all talk and work together
    • Master Data Management by Talend: We provide end-to-end capabilities and experience to master your data through architecting and deploying Talend MDM. We tailor the deployment to drive the best result for your specific industry – Retail, Financial, Healthcare, Insurance, Technology, Travel, Telecommunications, and others
    • Business Process Management: Our expertise in Talend Open Studio will lead the way for your organization’s overall BPM strategy

WHAT WE DO

As a leading Systems Integrator with years of expertise in the latest and greatest integrating numerous IT technologies, we help you work smarter, not harder, and at a better Total Cost of Ownership. Our resources are based throughout the United States and around the world. We have subject matter expertise in numerous industries and solving IT and business challenges.

We blend all types of data and transform it into meaningful insights by creating high performance Big Data Lakes, MDM, BI, Cloud, and Mobility Solutions.

What We Do

OUR CLOUD DATA LAKE SOLUTION

CloudCDC Data Replication

CloudCDC is equipped with the most intuitive and user friendly interface. With in a couple of clicks, you can load, transfer and replicate data to any platforms without any hassle. Do not worry about codes or scripts.

FEATURES

• Build Data Lake on AWS, Azure and Hadoop

• Continuous Real Time Data Sync.

• Click-to-replicate user interface.

• Automated Integration & Data Type Mapping.

• Automated Schema Build.

• Codeless Development Environment.

OUR SOLUTION ENHANCES DATA MANAGEMENT ACROSS INDUSTRIES

Enhances Data Across Industries

CONTACT THE EXPERTS AT BIGDATA DIMENSION FOR YOUR CLOUDCDC, TALEND, DATA ANALYTICS, AND BIG DATA NEEDS. CONTACT US TODAY TO LEARN MORE!

Leave a Reply