• LinkedIn

  • Follow via Facebook

  • Follow via Twitter

  • Submit RFP

  • Contact Us

How to Test Machine Learning Algorithms with K-Fold Cross Validation

Posted by BDD Talend Practice
Category:

michine

Given a prediction scenario involving a machine learning algorithm, the first question to ask is what is the appropriate machine learning algorithm? Taking the example of predicting a user’s activity based on mobile phone accelerometer data, we must be able to classify a category for the data (resting, walking, or running). As Talend leverages Spark MLlib out-of-the-box, we evaluate some of the popular algorithms which fall under classification.

This classification exercise presents common algorithms such as Logistic Regression, Decision Tree, Random Forest, and Naïve Bayes. Logistic Regression is not a candidate as it only supports binary (two group) classification. Naïve Bayes can only represent non-negative frequency counts of features; therefore it was not a candidate as accelerometer data has negative values. However, this could be mediated by simply scaling all the data to ensure positive values (i.e. multiplying all values times 100). This leaves Decision Tree and Random Forest.

At this point, we take both algorithms and measure the accuracy of each model. If we were to take the initial training dataset (which has been classified by hand) and apply it against each model, we would see a high accuracy rate if used as test data. However, what would the accuracy look like if the model encountered data it has never seen before? To test the accuracy of both algorithms, we leverage a validation technique such as K-Fold.

Applying K-Fold Technique

In K-Fold Cross Validation, the training dataset is partitioned into two pieces: training and test, where K represents the number of folds or observations to take place. For example, if we have a training dataset with 450 events, and we chose 10-Fold validation, then this would break up the training dataset into 10 folds:

Taking a training dataset with 450 events against 10-Fold Cross Validation would produce a test dataset of 45 events and a training dataset of 405 events. This process is then repeated K times (10) and the resulting accuracy is averaged to produce an overall, more realistic accuracy of the model being tested.

To help us determine which algorithm is the most accurate with our dataset, we can build out the validation technique graphically using Talend Studio, and apply it against the algorithms being tested.

Building K-Fold in Talend Studio

Leveraging the out-of-the-box machine learning algorithms, we will build a K-Fold Cross Validation job in Talend Studio and test against a Decision Tree and Random Forest. The training set used for this example can be downloaded on GitHub.

Before we can build the validation, we build a job to encode each model being tested. Each of these jobs consist of reading the training dataset, encoding the model vectors, and saving the model.

When the model will be tested, each test dataset will produce an accuracy which we will store and output at the end of our K-Fold validation. After these are built, we can begin work on the validation job.

The overall validation job will have 3 main routines:

  1. Read the full training dataset and partition into a test dataset and training dataset for each K-Fold
  2. Pass the partitioned training dataset to create the model and pass the partitioned test dataset to test the model
  3. Calculate the accuracy for each K-Fold

Partitioning Datasets

For the first piece of the validation job, we need to read the full training dataset, capture the row count, and then configure some variables which will be used in processing the validation. The most obvious variable is the number of K-Folds to be processed. This is stored as a context parameter and is requested when the validation is run:

Based on the number of K-Folds, we will be able to calculate the following:

  1. Row Number – used to filter the rows when creating the test bin and training data
  2. Fold Size – the current iteration of the loop used to calculate the size (number of rows) of the test bin
  3. K Value – calculated value based on the total number of rows in the original training data and the number of parameterized folds
  4. Bin Start – the row number where the current test dataset starts
  5. Bin End – the row number where the current test dataset ends

For development purposes, the row numbers and variables can be sent to the console. When we specify 10 for K-Folds, we should see an output such as:

.----------+------+------+------+-------+-------+---------+---------+-------.
|                                   loop 0                                  |
|=---------+------+------+------+-------+-------+---------+---------+------=|
|row_number|aX    |aY    |aZ    |label  |k_value|fold_size|bin_start|bin_end|
|=---------+------+------+------+-------+-------+---------+---------+------=|
|1         |-4.1  |8.07  |-16.36|running|0      |45       |1        |45     |
|2         |-2.34 |9.69  |-0.33 |running|0      |45       |1        |45     |
|3         |0.0   |0.01  |-0.01 |resting|0      |45       |1        |45     |
|...       |...   |...   |...   |...    |...    |...      |...      |...    |
|450       |-0.01 |-0.02 |-0.07 |resting|0      |45       |1        |45     |
'----------+------+------+------+-------+-------+---------+---------+-------'
...
.----------+------+------+------+-------+-------+---------+---------+-------.
|                                   loop 9                                  |
|=---------+------+------+------+-------+-------+---------+---------+------=|
|row_number|aX    |aY    |aZ    |label  |k_value|fold_size|bin_start|bin_end|
|=---------+------+------+------+-------+-------+---------+---------+------=|
|1         |-4.1  |8.07  |-16.36|running|9      |45       |406      |450    |
|2         |-2.34 |9.69  |-0.33 |running|9      |45       |406      |450    |
|3         |0.0   |0.01  |-0.01 |resting|9      |45       |406      |450    |
|...       |...   |...   |...   |...    |...    |...      |...      |...    |
|450       |-0.01 |-0.02 |-0.07 |resting|9      |45       |406      |450    |
'----------+------+------+------+-------+-------+---------+---------+-------'

For each loop (starting at 0), we calculate the fold size and find the offset to create the test datasets and training datasets. Once we’ve validated the calculations, we can add a filter which separates the outputs:

The loop will iterate K-Fold times and pick up the appropriate test or training dataset. This process is repeated for each model.

Calculate Accuracy of the Model

The last step is to aggregate the results from the validation. Again, the loop will iterate K-Fold times and read from the outputs of each test to calculate the accuracy.

After running the validation job against a model, we receive an output of the accuracy of each K-Fold:

0|0.9777777777777777
1|0.9555555555555556
2|0.9333333333333333
3|0.9777777777777777
4|0.8888888888888888
5|0.9777777777777777
6|0.9777777777777777
7|0.9555555555555556
8|0.9777777777777777
9|0.9111111111111111

All the Pieces

Putting it all together, the final K-Fold Cross Validation job will look like:

When not testing one model, it can simply be de-activated. The first routine to partition the training data can also be de-activated after running once. We also leveraged context parameters to pass information between this validation job to the create and test model jobs.

Conclusion

Machine Learning provides the ability to learn and make predictions on different types of data. In this example, we’ve taken two classification algorithms (Decision Tree & Random Forest) and used a K-Fold Cross Validation technique to determine which algorithm would have a higher accuracy for classifying the user activity based on accelerometer sensor data.

Leveraging Talend’s graphical design environment to build out the validation technique, choosing an appropriate algorithm was simple. Using the provided training dataset, Random Forest had a slightly higher overall accuracy over Decision Tree using 10-Fold Validation. In addition, each validation job ran in just under 3 minutes with Spark under the covers.

There were a few exceptions not built into this validation job such as handling odd rows, or performing clean up steps after each model has been tested. As with all development, refactoring could also be done to make this more dynamic for other data sets. However, the goal was to leverage a graphical design environment and out-of-the-box machine learning algorithms to help us build and choose an appropriate model.

ABOUT BIG DATA DIMENSION

BigData Dimension is a leading provider of cloud and on-premise solutions for BigData Lake Analytics, Cloud Data Lake Analytics, Talend Custom Solution, Data Replication, Data Quality, Master Data Management (MDM), Business Analytics, and custom mobile, application, and web solutions. BigData Dimension equips organizations with cutting edge technology and analytics capabilities, all integrated by our market- leading professionals. Through our Data Analytics expertise, we enable our customers to see the right information to make the decisions they need to make on a daily basis. We excel in out-of-the-box thinking to answer your toughest business challenges.

Talend Unified Solution

You’ve already invested in Talend project or maybe you already have a Talend Solution implemented, but may not be utilizing the full power of the solution. To get the full value of the product, you need to get the solution implemented from industry experts.

At BigData Dimension, we have experience spanning over a decade integrating technologies around Data Analytics. As far as Talend goes, we’re one of the few best-of-breed Talend-focused systems integrators in the entire world. So when it comes to your Talend deployment and getting the most out of it, we’re here for you with unmatched expertise.

Our work covers many different industries including Healthcare, Travel, Education, Telecommunications, Retail, Finance, and Human Resources.

We offer flexible delivery models to meet your needs and budget, including onshore and offshore resources. We can deploy and scale our talented experts within two weeks.

GETTING STARTED

  • Full requirements analysis of your infrastructure
  • Implementation, deployment, training, and ongoing services both cloud-based and/or on-premise

MEETING YOUR VARIOUS NEEDS

    • BigData Management by Talend: Leverage Talend Big Data and its built-in extensions for NoSQL, Hadoop, and MapReduce. This can be done either on-premise or in the cloud to meet your requirements around Data Quality, Data Integration, and Data Mastery
    • Cloud Integration and Data Replication: We specialize in integrating and replicating data into Redshift, Azure, Vertica, and other data warehousing technologies through customized revolutionary products and processes.
    • ETL / Data Integration and Conversion: Ask us about our groundbreaking product for ETL-DW! Our experience and custom products we’ve built for ETL-DI through Talend will give you a new level of speed and scalability
    • Data Quality by Talend: From mapping, profiling, and establishing data quality rules, we’ll help you get the right support mechanisms setup for your enterprise
    • Integrate Your Applications: Talend Enterprise Service Bus can be leveraged for your enterprise’s data integration strategy, allowing you to tie together many different data-related technologies, and get them to all talk and work together
    • Master Data Management by Talend: We provide end-to-end capabilities and experience to master your data through architecting and deploying Talend MDM. We tailor the deployment to drive the best result for your specific industry – Retail, Financial, Healthcare, Insurance, Technology, Travel, Telecommunications, and others
    • Business Process Management: Our expertise in Talend Open Studio will lead the way for your organization’s overall BPM strategy

WHAT WE DO

As a leading Systems Integrator with years of expertise in the latest and greatest integrating numerous IT technologies, we help you work smarter, not harder, and at a better Total Cost of Ownership. Our resources are based throughout the United States and around the world. We have subject matter expertise in numerous industries and solving IT and business challenges.

We blend all types of data and transform it into meaningful insights by creating high performance Big Data Lakes, MDM, BI, Cloud, and Mobility Solutions.

What We Do

OUR CLOUD DATA LAKE SOLUTION

CloudCDC Data Replication

CloudCDC is equipped with the most intuitive and user friendly interface. With in a couple of clicks, you can load, transfer and replicate data to any platforms without any hassle. Do not worry about codes or scripts.

FEATURES

• Build Data Lake on AWS, Azure and Hadoop

• Continuous Real Time Data Sync.

• Click-to-replicate user interface.

• Automated Integration & Data Type Mapping.

• Automated Schema Build.

• Codeless Development Environment.

OUR SOLUTION ENHANCES DATA MANAGEMENT ACROSS INDUSTRIES

Enhances Data Across Industries

CONTACT THE EXPERTS AT BIGDATA DIMENSION FOR YOUR CLOUDCDC, TALEND, DATA ANALYTICS, AND BIG DATA NEEDS. CONTACT US TODAY TO LEARN MORE!

Leave a Reply