Aug 152014
 

Spoon-day3I’m proud to introduce my free mini course to help you learn how to use Pentaho PDI (Kettle).

Are you tired of trying to move data or clean your data using a programming or scripting language?

Are you tired of rushing to meet deadlines to get information moved from one system to another?

Pentaho PDI (also known as Kettle) is the tool for you.

Although PDI (Kettle) is easy to use once you understand  it, it can take a little while to climb the learning curve.  To speed up that process, I’ve created a short, free tutorial.  Click the button below to get started.

getstartednow

Jul 242014
 

Pentaho PDI (also known as Kettle) is written in Java, and therefore runs under the Java Virtual Machine. This typically works great, until you try to load too much data into memory in PDI. PDI is usually pretty good about swapping memory out to disk when it runs low, but there are situations where you will get a Java out of memory exception or “OutOfMemory” exception.

When you get hit with an out of memory exception, you will need to tell the Java Virtual Machine to allocate more memory to PDI. It’s not tough to do, but it does depend on which operating system you are using.

Windows

If you are using Windows, you probably launch PDI by running spoon.bat. Inside spoon.bat you need to find the following code:

REM ******************************************************************
REM ** Set java runtime options **
REM ** Change 512m to higher values in case you run out of memory **
REM ** or set the PENTAHO_DI_JAVA_OPTIONS environment variable **
REM ******************************************************************

if “%PENTAHO_DI_JAVA_OPTIONS%”==”” set PENTAHO_DI_JAVA_OPTIONS=”-Xmx512m” “-XX:MaxPermSize=256m”

We’re interested in the -Xmx parameter. As it’s shown above, it will allocate 512 MB. Put a larger number in here to get more memory allocated, such as “-Xmx2048m”. You can also use a “g” (for gigabyte) instead of m (for megabyte). For example, “-Xmx2g” would allocate 2 gigabytes of memory.

(Just in case you are interested, the MaxPermSize is the amount of memory allocated to the Java classes (the program itself). You shouldn’t need to tweak this.)

If you are using kitchen.bat to run jobs or pan.bat to run transformations, search for this line:
if “%PENTAHO_DI_JAVA_OPTIONS%”==”” set PENTAHO_DI_JAVA_OPTIONS=-Xmx512m

Again, change the 512 to a larger number.

Linux

If you are using Linux, you probably launch PDI by running spoon.sh. Inside spoon.bat you need to find the following code:

REM ******************************************************************
REM ** Set java runtime options **
REM ** Change 512m to higher values in case you run out of memory **
REM ** or set the PENTAHO_DI_JAVA_OPTIONS environment variable **
REM ******************************************************************

if “%PENTAHO_DI_JAVA_OPTIONS%”==”” set PENTAHO_DI_JAVA_OPTIONS=”-Xmx512m” “-XX:MaxPermSize=256m”

We’re interested in the -Xmx parameter. As it’s shown above, it will allocate 512 MB. Just put a larger number in here to get more memory allocated, such as “-Xmx2048m”. You can also use a “g” (for gigabyte) instead of m (for megabyte). For example, “-Xmx2g” would allocate 2 gigabytes of memory.

If you are using kitchen.sh to run jobs or pan.sh to run transformations, search for this line:
JAVAMAXMEM=”512″

Change the 512 to a larger number. Be careful here, the way the script is written, this value MUST be in megabytes.

Mac iOS

To increase the memory allocated Spoon (the graphical design tool) for the iOS applications, you need to find the “Info.plist” file. If you installed PDI in a “pentaho” directory under “/Applications” it should be at “/Applications/pentaho/data-integration/Data Integration 64-bit.app/Contents/Info.plist”. If you are using the 32 bit version, just replace the 64 with 32.

Inside the Info.plist file, find the following:

<key>VMOptions</key>
<string> -Xms512m -Xmx512m</string>
Change the -Xms512m to something bigger. For example, to allocate 2048 megabytes, you would change it to:

<key>VMOptions</key>
<string> -Xms2048m -Xmx512m</string>

You could also change it to -Xmx2g for 2 gigabytes.

If you are using kitchen.sh to run jobs or pan.sh to run transformations, search for this line:
JAVAMAXMEM=”512″

Change the 512 to a larger number. Be careful here, the way the script is written, this value MUST be in megabytes.

Good luck with your OutOfMemory exception.

Apr 052014
 

I frequently find that I need to use CSV (comma separated value) files to pull data into Pentaho Kettle (PDI).  The CSV File Input step is just the tool to do it.

The Pentaho Kettle (PDI) CSV File Input Step is pretty straightforward.  There are a few little things you might want to watch out for.

Delimiter – this is the character that separates the field in your input file.  If you’re using a CSV file, this will be a comma.  If you are using tabs, you can use the “Insert TAB” button.  Be careful, you won’t see it in the field (tabs don’t show up).  You can highlight the field by selecting the contents to see that there are several spaces there (which represent the tab).

Enclosure – often double quotes, this is what is used to enclose a field if the delimiter is used in the field.  This might happen often.  For instance, if you have a value of 4,345 in a comma separated file, the comma in your number might be mistaken for the delimiter.  In that case 4,345 would be mistaken for two values – 4 and 345.  Enclosing the number in quotes will prevent this problem.

Field formatting – this might be the trickiest part of the CSV File Input Step.  There are a couple of good sources that explain the formatting.  http://wiki.pentaho.com/display/EAI/Text+File+Input (go down to the Fields section) has some good general information.  For a deep dive into the date format, check out http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

I like using the “Get Fields” button to pull in my fields for me.  Be careful, you should always check to make sure it scanned the data correctly.

If you find that you need more flexibility, the Pentaho Kettle (PDI) Text File Input step can also be used to import CSV files.  It also allows you to read in multiple files.

If you enjoyed this episode, please leave a comment below.

Don’t forget to sign up for the BI Dimensions newsletter to find out about more great resources to learn about Business Intelligence.

Thanks for your time and attention…

Mar 162014
 

Don’t Believe Your First Impressions

When I first received my free review copy of “Pentaho for Big Data Analytics” by Manoj R. Patil and Feris Thia (Packt Publishing, ASIN: B00GX9FFEI), I was excited to pop open the file and take a peek.

My first impression, and my biggest complaint with the book, is that it uses a now outdated version of Pentaho. The book is based on Pentaho version 4.8. Pentaho released version 5.0 in the Fall of 2013, making the book dated. (Skip the entire section of the book on the Enterprise Console, it no longer exists in version 5.0.)

So, I ignored the book for a while, having lost some of my initial enthusiasm.

However, having cracked it back open, there are a LOT of great things about this book.

The book is relatively short in these days of “books so big they could be used as a weapon”. (Thank goodness for electronic books.) The book is 118 pages.

Here is a list of the chapters of the book:

1 – The Rise of Pentaho Analytics along with Big Data
2 – Setting Up the Ground
3 – Churning Big Data with Pentaho
4 – Pentaho Business Analytics Tools
5 – Visualization of Big Data

My thoughts about the book…

It focuses on the community edition of Pentaho. I like this. There isn’t enough good documentation for the CE version of Pentaho tools. This book helps.

The “Setting Up the Ground” chapter does a thorough job of helping get the Pentaho server up and running and shows how to add plugins. Again, it’s helpful that they used the community edition. Installation of Pentaho’s tools is considerably easier for those who can afford the Enterprise Edition, making this section valuable for those using the Community Edition.

Even with all the discussion about Big Data and Hadoop, many people still don’t know the basics of Hadoop. The authors have done a nice job describing what Hadoop is and how it works. They go into depth using the HortonWorks sandbox, a tool that is quickly becoming one of the best ways to learn about Hadoop.

Their tutorials are in depth, showing you exactly what you need to do. I especially liked the tutorial showing how to pull data from and send data to HDFS using PDI (Pentaho Data Integration, aka Kettle). They also show how to load data from HDFS to Hive using Pentaho tools.

Having spent most of my time with the Enterprise Edition of Pentaho, I’ve always wanted to work more with the community edition visualization tools. This book helped me with that. They go in depth with CTools, a set of dashboarding and visualization tools. They step through making a dashboard, using PDI as a source for the dashboard and doing visualizations using CTools (including how to tinker with the CSS files to make it look nice.)

There are two sections to the appendix. The first provides a rich collection of free big data sources, so you can start tinkering with big data on your own. The second is an in depth tutorial about setting up the HortonWorks Hadoop sandbox.

In Conclusion

My initial reaction to this book was wrong. Although written using an outdated version of the Pentaho tools, this book is filled with useful information that is still very relevant to the new version. The authors have done a nice job of providing detailed examples of using Pentaho with Big Data. They get extra credit points for using the community edition of Pentaho.

Worth buying? Yes.  “Pentaho for Big Data Analytics” is worth picking up.

Feb 232014
 

Joining data is one of the basic things that you’ll need to do with your data.  Sometimes it’s best done in your database, but you often need to do it in Kettle because you are bringing together data from different sources.

The Merge Join step allows you to do four different kinds of joins

  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join

(Don’t miss my Merge Step Tutorial to help understand what joins do)

Gotta Know:  “First Step” is the “Left” part of the join and the “Second Step” is the “Right” part of the join

IMPORTANT:  don’t forget that you need to sort your data based on the fields that you are joining BEFORE you do the merge join

Feb 182014
 

Joins can be confusing.

But they don’t have to be.

It just takes a few minutes to figure out what’s going on.  (And I’ll admit, there are still times they seem a little goofy to me.)

Check out this video for an explanation of the joins you will find in the Pentaho Kettle Merge Join Step:

  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join

Here is a pdf copy of the Pentaho Kettle Merge Join Tutorial for you to download.

Feb 062014
 

Ever need to delete an entire directory?  It’s as simple as can be with the Pentaho Kettle Delete Folders entry.

Used with Kettle jobs (not transforms), you can specify the folder directory or pass in the folder through results from a previous entry.

Copy previous results to args? – Check this box if you want to pass the path of your folder into the entry from a previous entry.  When you check this, the folders section is disabled.

Success condition – this is used to determine what success looks like.

  • When no error reported – success means there were no errors

  • When at least x folders are deleted – success means that at least some number of folders was deleted.  The number can be specified in the Limit field.

  • When number of errors less than – success means that there were less than a certain number of errors.  That number can be specified in the Limit field.

Folder – browse for the folder and then add it to the list of folders to be deleted.

 

There is no support for regular expression searching of folders.  However, you can do that with a separate entry and pass the results in using the “Copy previous results to args” feature.

Feb 062014
 

I’ve had several occasions where I need to store some data between steps or jobs while running a BI job.

In one particular instance, I found myself pulling the same data from a database several times across different transforms.  To save my database from the extra work, I pulled the data the first time, and saved it to a local file.  However, instead of saving it as a text file, I used the Pentaho Kettle Serialize To File and Serialize From From steps.

These steps are easy to use.  You specify the file location and name and everything in the stream is saved with the Serialize to File step.  EVERYTHING!  Meta data and all.  It’s as if you never left your data.  The Serialize From File step can then be used to bring the data back into a different transform.

There are two ways this is more efficient:

  1. It takes up much less disc space because the data is saved in it’s native binary format, not as text.

  2. Reading the data from disc using Serialize From File is faster than reading the text data.  Kettle doesn’t need to parse the text, it’s significantly faster.

These steps are only to be used for short term storage however.  You cannot count on the steps maintaining their integrity across versions of Kettle.  Don’t use it for data that you will be using for a long time.

Jan 282014
 

When using Pentaho Kettle (PDI), there are times that you need to create your own data.  Once in a while you need to introduce rows of data that will be used by your transform.  I’ve found that it’s more common to need data because I don’t have real data to work with yet.  In that case, I hack together some data that will look like what I’ll be getting.  The Data Grid works great for this.

Watch the video above to see the Data Grid in action.

There are two tabs on the data grid.

The first tab is the Meta tab.  This is used to specify the name of each field, it’s data type and formatting information.

The Data tab allows you to enter your data row by row, with each row having a spot for each field you specified on the Meta tab.

Jan 262014
 

Ever need to do some statistics or manipulations on groups of rows in Pentaho PDI (Kettle)?

If so, the “Group By” step is the place to turn.

Group By allows you to specify your groups, and then the operation you’d like to perform on it.

Check out the video above to see it in action.

Here is a summary of the fields in the step:

Step Name – the name that shows on the transform for the step

Include all rows? – check this box and all of the input data will still be in the output, with the aggregated field added to each row of the group

Temporary files directory – where Kettle stores it’s temp files when it’s performing this step.

TMP-file prefix – the prefix Kettle will put on each of the temp files – it makes them easier to find

Add line number, restart in each group – check this box to have each row in the group numbered, restarting with each group

Line number field name – if you checked the box on the line above, this is the field where the number is kept

Always give back a results row – check if you want a result row (probably zero) if there is no input data

The fields that make up the group – these fields create the group

Aggregates – this is what you want done with the group.

  • Name – the field to put the result
  • Subject – the field to operate on
  • Type – what you want done (sum, average, concatenate strings, etc)
  • Value – used with certain “Types” of aggregation

SPECIAL NOTE:

You MUST sort the data before it gets to this step, and the sorting must be identical to what you have in the “The fields that make up the group” section.