Pentaho data integration 4 cookbook pdf


Contribute to happyapple/gavin-repo development by creating an account on GitHub. Pentaho Data Integration 4 CookbookOver 70 recipes to solve ETL problems using Pentaho KettleAdrián Sergio Pulviren. Salad Recipes. Table of Contents Mushroom Antipasto Pasta Salad 4 cups pasta , cold, cooked 2 cups The Salad Mast Pentaho data integration 4.

Language:English, Spanish, Portuguese
Published (Last):06.08.2016
Distribution:Free* [*Register to download]
Uploaded by: LEONE

48713 downloads 111197 Views 27.76MB PDF Size Report

Pentaho Data Integration 4 Cookbook Pdf

Pentaho Data Integration or also called Kettle is one of the best open source tool recipes: Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation. Pentaho Data Integration (PDI) is a powerful extract, transform, and load (ETL) 4. Click the Download Pentaho Data Integration GA button to begin the.

No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions. I'd like to thank my lovely kids Camila and Nicolas, who understood that I couldn't share with them the usual videogame sessions during the writing process. I'd also thank my wife who introduced me to the Pentaho world. She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently she works for Webdetails, one of the main Pentaho contributors.

Getting data from a database by running a query built at runtime When you work with databases, most of the times you start by writing an SQL statement that gets the data you need. However, there are situations in which you don't know that statement exactly. Maybe the name of the columns to query are in a file, or the name of the columns by which you will sort will come as a parameter from outside the transformation, or the name of the main table to query changes depending on the data stored in it for example sales Assume the following situation: You have a database with data about books and their authors, and you want to generate a file with a list of titles.

Whether to retrieve the data ordered by title or by genre is a choice that you want to postpone until the moment you execute the transformation. The column that will define the order of the rows will be a named parameter. Remember that Named Parameters are defined in the Transformation setting window and their role is the same as the role of any Kettle variable.

If you prefer, you can skip this step and define a standard variable for this purpose. Now drag a Table Input step to the canvas. Then create and select the connection to the book's database. Check the option Replace variables in script? Use an Output step like for example a Text file output step, to send the results to a file, save the transformation, and run it. Open the generated file and you will see the books ordered by title.

Now try again. Press F9 to run the transformation one more time. Press the Launch button. Open the generated file. This time you will see the titles ordered by genre. When the transformation is initialized, PDI replaces the variables by their values provided that the Replace variables in script?

You could even hold the full statement in a variable. Note however that you need to be cautious when implementing this.

A wrong assumption about the metadata generated by those predefined statements can make your transformation crash. You can also use the same variable more than once in the same statement. This is an advantage of using variables as an alternative to question marks when you need to execute parameterized SELECT statements. See also f Getting data from a database by providing parameters. This recipe shows you an alternative way to parameterize a query.

Pentaho Data Integration 4 Cookbook

Inserting or updating rows in a table Two of the most common operations on databases besides retrieving data are inserting and updating rows in a table. PDI has several steps that allow you to perform these operations. Before inserting or updating rows in a table by using this step, it is critical that you know which field or fields in the table uniquely identify a row in the table. If you don't have a way to uniquely identify the records, you should consider other steps, as explained in the There's more Assume this situation: You have a file with new employees of Steel Wheels.

You have to insert those employees in the database. The file also contains old employees that have changed either the office where they work, or the extension number, or other basic information.

You will take the opportunity to update that information as well. Create a transformation and use a Text File input step to read the file employees. Provide the name and location of the file, specify comma as the separator, and fill in the Fields grid. Remember that you can quickly fill the grid by pressing the Get Fields button. As target table type employees.

Fill the grids as shown: 5. Save and run the transformation. Explore the employees table. For each row in your stream, Kettle looks for a row in the table that matches the condition you put in the upper grid, the grid labeled The key s to look up the value s :. It doesn't find one.

Consequently, it inserts a row following the directions you put in the lower grid. Then, it updates that row according to what you put in the lower grid. It only updates the columns where you put Y under the Update column. If you run the transformation with log level Detailed, you will be able to see in the log the real prepared statements that Kettle performs when inserting or updating rows in a table. Here there are two alternative solutions to this use case.

This would be faster because you would be avoiding unnecessary lookup operations. The Table Output step is really simply to configure: Just select the database connection and the table where you want to insert the records. If the names of the fields coming to the Table Output step have the same name as the columns in the table, you are done. In order to handle the error when creating the hop from the Table Output step towards the Update step, select the Error handling of step option.

Alternatively right-click the Table Output step, select Define error handling Finally, fill the lower grid with those fields that you want to update, that is, those rows that had Y under the Update column. In this case, Kettle tries to insert all records coming to the Table Output step. The rows for which the insert fails go to the Update step, and the rows are updated.

The Table Output would insert all the rows. Those that already existed would be duplicated instead of updated. In general, for best practices reasons, this is not an advisable solution. If the table where you have to insert data defines a primary key, you should generate it. This recipe explains how to do it when the primary key is a simple sequence. Same as the previous bullet, but in this case the primary key is based on stored values.

Inserting new rows where a simple primary key has to be generated It's very common to have tables in a database where the values for the primary key column can be generated by using a database sequence in those DBMS that have that feature; for example, Oracle or simply by adding 1 to the maximum value in the table. Loading data into these tables is very simple. This recipe teaches you how to do this through the following exercise. There are new offices at Steel Wheels. Getting ready For this recipe you will use the Pentaho sample database.

If you don't have that database, you'll have to follow the instructions in the introduction of this chapter. As you will insert records into the office table, it would be good if you explore that table before doing any insert operations. Create a transformation and create a connection to the sampledata database. Use a Text file input step to read the offices. Double-click the step, select the connection to the sampledata database, and type offices as the Target table.

Fill the Key fields grid as shown: 6. For the Creation of technical key fields leave the default values. From the Output category of steps, add an Update step. It's time to save the transformation and run it to see what happens. As you might guess, three new offices have been added, with primary keys 8, 9, and In many situations, before inserting data into a table you have to generate the primary key.

Because the offices are new, there aren't offices in the table with the same combination of address, city, and country values the lookup fails. Then, it inserts a row with the generated primary key and the fields you typed in the grid.

Finally, the step adds to the stream the generated primary key value. But as you could see, it can also be used in the particular situation where you have to generate a primary key. In the recipe you generated the PK as the maximum plus one, but as you can see in the setting window, a database sequence can also be used instead. Now suppose that you have a row that existed in the table.

Pentaho Data Integration 4 Cookbook | PACKT Books

In that case the lookup would have succeeded and the step wouldn't have inserted a new row. That field would have been added to the stream, ready to be used further in the transformation, for example for updating other fields as you did in the recipe, or for being used for inserting data in a related table. Note that this is a potentially slow step, as it uses all the values for the comparison. See also f Inserting new rows when the primary key has to be generated based on stored values. This recipe explains the case where the primary key to be generated is not as simple as adding one to the last primary key in the table.

Inserting new rows where the primary key has to be generated based on stored values There are tables where the primary key is not a database sequence, nor a consecutive integer, but a column which is built based on a rule or pattern that depends on the keys already inserted.

For example imagine a table where the values for primary key are A, A, and A In this case, you can guess the rule: putting an A followed by a sequence. The next in the sequence would be A This seems too simple, but doing it in PDI is not trivial.

This recipe will teach you how to load a table where a primary key has to be generated based on existing rows as in that example. Suppose that you have to load author data into the book's database. You have the main data for the authors, and you have to generate the primary key as in the example above. Getting ready Run the script that creates and loads data into the book's database.

Create a transformation and create a connection to the book's database. Use a Text file input step to read the authors. For simplicity, the authors. To generate the next primary key, you need to know the current maximum. So use a Table Input step to get it.

You will have a simple clear transformation, but it will take several Kettle steps to do it. By using a Join Rows cartesian product step, join both streams. Your transformation should look like this: 27 Working with Databases 5. Add an Add sequence step. For the rest of the fields in the setting window leave the default values. Add a Calculator step to build the keys.

You do it by filling the setting window as shown: 7. In order to insert the rows, add a Table output step, double-click it, and select the connection to the book's database. As Target table type authors.

Check the option Specify database fields. Select the Database fields tab and fill the grid as follows: Explore the authors table. When you have to generate a primary key based on the existing primary keys, unless the new primary key is simple to generate by adding one to the maximum, there is no direct way to do it in Kettle.

One possible solution is the one shown in the recipe: Getting the last primary key in the table, combining it with your main stream, and using those two sources for generating the new primary keys. This is how it worked in this example.

First, by using a Table Input step, you found out the last primary key in the table. In fact, you got only the numeric part needed to build the new key. In this exercise, the value was 9. With the Join Rows cartesian product step, you added that value as a new column in your main stream. Taking that number as a starting point, you needed to build the new primary keys as A, A, and so on. Then it converts the result to a String giving it the format with the mask This led to the values , , and so on.

It concatenates the literal A with the previously calculated ID. Note that this approach works as long as you have a single user scenario.

If you run multiple instances of the transformation they can select the same maximum value, and try to insert rows with the same PK leading to a primary key constraint violation. The key in this exercise is to get the last or maximum primary key in the table, join it to your main stream, and use that data to build the new key. After the join, the mechanism for building the final key would depend on your particular case. See also f Inserting new rows when a simple primary key has to be generated.

If the primary key to be generated is simply a sequence, it is recommended to examine this recipe. If you face the second of the above situations, you can even use a Truncate table job entry. For more complex situations you should use the Delete step.

Let's suppose the following situation: You have a database with outdoor products. Each product belongs to a category: tools, tents, sleeping bags, and so on.

Getting ready In order to follow the recipe, you should download the material for this chapter: a script for creating and loading the database, and an Excel file with the list of categories involved. After creating the outdoor database and loading data by running the script provided, and before following the recipe you can explore the database.

The value to which you will compare the price before deleting will be stored as a named parameter. Drag to the canvas an Excel Input step to read the Excel file with the list of categories. After that, add a Database lookup step. So far, the transformation looks like this: For higher volumes it's better to get the variable just once in a separate stream and join the two streams with a Join Rows cartesian product step.

Select the Database lookup step and do a preview. You should see this: 31 Working with Databases 7. Finally, add a Delete step.

You will find it under the Output category of steps. Double-click the Delete step, select the outdoor connection, and fill in the key grid as follows: 9. Explore the database. The Delete step allows you to delete rows in a table in a database based on certain conditions.

In this case, you intended to delete rows from the table products where the price was less than or equal to 50, and the category was in a list of categories, so the Delete step is the right choice. This is how it works. Then, for each row in your stream, PDI binds the values of the row to the variables in the prepared statement. Let's see it by example. In the transformation you built a stream where each row had a single category and the value for the price.

Note that the conditions in the Delete step are based on fields in the same table. In this case, as you were provided with category descriptions and the products table does not have the descriptions but the ID for the categories, you had to use an extra step to get that ID: a Database lookup. Suppose that the first row in the Excel file had the value tents. Refer to this recipe if you need to understand how the Database lookup step works. These are some use cases: f You receive a flat file and have to load the full content in a temporary table f You have to create and load a dimension table with data coming from another database You could write a CREATE TABLE statement from scratch and then create the transformation that loads the table, or you could do all that in an easier way from Spoon.

In this case, suppose that you received a file with data about countries and the languages spoken in those countries. You need to load the full content into a temporary table. The table doesn't exist and you have to create it based on the content of the file. Getting ready In order to follow the instructions, you will need the countries. Create a transformation and create a connection to the database where you will save the data. In order to read the countries. Fill the Fields grid as follows: The symbol preceding the field isofficial is optional.

By selecting Attribute as Element Kettle automatically understands that this is an attribute. From the Output category, drag and drop a Table Output step. Create a hop from the Get data from XML step to this new step. Double-click the Table Output step and select the connection you just created. Click on the SQL button.

After clicking on Execute, a window will show up telling that the statement has been executed, that is, the table has been created. All the information coming from the XML file is saved into the table just created. PDI allows you to create or alter tables in your databases depending on the tasks implemented in your transformations or jobs.

To understand what this is about, let's explain the previous example. The insert is made based on the data coming to the Table Output and the data you put in the Table Output configuration window, for example the name of the table or the mapping of the fields. When you click on the SQL button in the Table Output setting window, this is what happens: Kettle builds the statements needed to execute that insert successfully.

When the window with the generated statement appeared, you executed it. This causes the table to be created, so you could safely run the transformation and insert into the new table the data coming from the file to the step. The SQL button is present in several database-related steps.

In all cases its purpose is the same: Determine the statements to be executed in order to run the transformation successfully. Note that in this case the execution of the statement is not mandatory but recommended. You can execute the SQL as it is generated, you can modify it before executing it as you did in the recipe , or you can just ignore it.

Sometimes the SQL generated includes dropping a column just because the column exists in the table but is not used in the transformation. In that case you shouldn't execute it. Read the generated statement carefully, before executing it.

Pentaho Data Integration 4 Cookbook - Home | Packt Publishing

Finally, you must know that if you run the statement from outside Spoon, in order to see the changes inside the tool you either have to clear the cache by right-clicking the database connection and selecting the Clear DB Cache option, or restart Spoon. See also f 36 Creating or altering a database table from PDI runtime. Instead of doing these operations from Spoon during design time, you can do them at runtime.

This recipe explains the details. Chapter 1 Creating or altering a database table from PDI runtime When you are developing with PDI, you know or have the means to find out if the tables you need exist or not, and if they have all the columns you will read or update.

If they don't exist or don't meet your requirements, you can create or modify them, and then proceed.

Assume the following scenarios: f You need to load some data into a temporary table. The table exists but you need to add some new columns to it before proceeding. This task is part of a new requirement, so this table doesn't exist. While you are creating the transformations and jobs, you have the chance to create or modify those tables. But if these transformations and jobs are to be run in batch mode in a different environment, nobody will be there to do these verifications or create or modify the tables.

You need to adapt your work so these things are done automatically. Suppose that you need to do some calculations and store the results in a temporary table that will be used later in another process.

As this is a new requirement, it is likely that the table doesn't exist in the target database. You can create a job that takes care of this. Create a job, and add a Start job entry. Link all the entries as shown: 4. Please review and fix it if needed because you are using a different DBMS.

Save the job and run it. Run the job again. Nothing should happen. The Table exists entry, as implied by its name, verifies if a table exists in your database. As with any job entry, this entry either succeeds or fails. If it fails, the job creates the table with an SQL entry.

If it succeeds, the job does nothing. The SQL entry is very useful not only for creating tables as you did in the recipe, but also for executing very simple statements, as for example setting a flag before or after running a transformation.

The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more. Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs.

The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools. Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks.

There are examples and code that are ready for adaptation to individual needs. Evaluate Confluence today. Pentaho Community.

Pages Blog. Child pages. Pentaho Books.

Similar articles

Copyright © 2019
DMCA |Contact Us