The Big Guide for Deploying IBM Information Server onto a Linux Grid
Vincent McBurney (Manager) posted 6/20/2008
Comments (0)
A few years ago you would hear complaints about DataStage and QualityStage documentation. There were no books and the manuals and help files had a lot of syntax but no examples. These days things are better. You don’t get a trickle Information Server documentation, or a steady rain, you get floods. Big floods. IBM have moved the Information Server out of the Gobi Desert and put it into the Mississippi Delta. The latest flood is a big RedBook on
Deploying a Grid Solution with IBM InfoSphere Information Server. It’s been released as a draft and IBM are looking from feedback and this one is another beauty.
A history of InfoSphere Information Server floods:
First in June 2007 there was the SOA flood in
SOA Solutions Using IBM Information Server that had over 500 pages of financial services scenario showing how to use DataStage and QualityStage SOA with Federated queries, DB2 stored procedures, WebSphere Integration Developer, Rational Application Developer and Microsoft Visual Studio.
Late in 2007 there was a profiling flood with
IBM WebSphere Information Analyzer and Data Quality Assessment. A look at profiling, rule building, deployment and monitoring using Information Analyzer and good old AuditStage.
Then came massive the QualityStage flood with over 900 pages in
IBM WebSphere QualityStage Methodologies, Standardization, and Matching. This came with heaps of example standardization and matching jobs using a merger/acquisition financial services scenario.
As if DataStage SOA wasn’t enough along came the DataStage of ’08 with
IBM InfoSphere DataStage Data Flow and Job Design. Over 600 pages of DataStage 8.1 design work using a retail scenario and the new Distributed Transaction stage. It had something missing from boring old manuals – it had standards and guidelines!
Grid Flood
As with the previous RedBooks this one refers to the products as InfoSphere products confirming the complete rebranding of the suite:
This IBM Redbook describes a scenario for migrating an existing IBM Information Server parallel framework implementation on a Red Hat Enterprise Linux AS 4 platform to a high availability grid environment involving four machines comprising one conductor node and three compute nodes. The steps involved in migrating the existing infrastructure to the high availability grid infrastructure and enabling existing IBM InfoSphere DataStage, IBM InfoSphere QualityStage, and IBM InfoSphere Information Analyzer jobs to exploit the grid environment are described here.
The authors of the RedBook were kidnapped from various parts of the world and locked into a cell at an IBM San Jose Center and told that for every day the didn't work on an Information Server RedBook a hamster would be killed. After a few weeks of research and documentation the team has released a draft and if we take the time to review the draft they can finish the RedBook and be released back to their families. The authors of the RedBook:
Nagraj Alur is a Project Leader with the IBM ITSO, San Jose Center.
Anthony Corrente is a Senior Technical Support Engineer in Australia. The best DataStage people come from Australia.
Robert D Johnston directs the High Availability (HA) and Grid solutions within the IBM Information Platforms Solutions group.
Sachiko Toratani is an IT Specialist providing technical support on IBM Information Platforms products to customers in Japan.
Other contributors: John Whyman, IBM Australia (who once helped me get Information Server running on a misbehaving Windows server after everyone else had given up). Tony Curcio and Tim Davis from IBM Westboro.
This RedBook starts with a grid computing overview with the benefits of a grid, grids versus clusters, type types of grids, grid environments (NAS versus SAN) and grid high availability - just the type of stuff you hear around the water cooler each day:
A simple way to understand grid computing is a scenario where all of the disparate computers and systems in an organization — or among organizations — become one large, integrated computing system. That single system can then be turned loose on problems and processes too large and intensive for any single computer to easily handle alone in an efficient manner.
It goes on to describe why a grid works well with the Information Server (although they don't go into why Informatica runs on a cloud and Information Server doesn't):
The Parallel Framework engine of IBM Infosphere Information Server enables IBM InfoSphere DataStage/QualityStage/Information Analyzer jobs to run in parallel on a single SMP server, or on multiple servers in a clustered environment. In both these cases, the degree of parallelism used by a job is determined by the nodes (and correspondingly the servers/machines) specified in the configuration file it associated with it. To change the degree of parallelism or the servers on which a job should run, you must modify the configuration file with the new number of nodes and its associated servers. This makes the association of nodes used by a job static. With a grid implementation, the static configuration files normally defined for a cluster are replaced with dynamic configuration files created at runtime. This is accomplished with the a software Grid Enablement toolkit.
The RedBook launches two free downloadable grid toolkits with one minor glitch – the download materials FTP folder wasn’t available on the launch day.
* The Build Your Own Grid (BYOG) toolkit is a set of scripts and templates designed to help configure a RedHat or SuSe Linux Grid environment. Its goal is to provide the tools necessary to build the frontend node as well as the compute nodes without system administration on each compute node. * The Grid Enablement toolkit is a set of scripts and templates whose goal is to create a dynamic configuration file (APT_CONFIG_FILE) using any one of several supported resource managers which identifies nodes that are available for processing.
The RedBook then takes you through the steps of a grid deployment. It talks about how to modify your DataStage, QualityStage and Information Analyzer jobs so they can run on the grid covering environment parameters, job parameters and stage property changes for new jobs and existing.
Under grid management it describes a resource manager using Tivoli LoadLeveller as an example:
IBM Tivoli Workload Scheduler LoadLeveler is a parallel job scheduling system that allows users to run more jobs in less time by matching each job's processing needs and priority with the available resources, thereby maximizing resource utilization. LoadLeveler dispatches jobs based on their priority, resource requirements and special instructions; for example, administrators can specify that long-running jobs run only on off-hours, that short-running jobs be scheduled around long-running jobs or that certain users or groups get priority.
It talks about the grid monitoring tool using Ganglia as an example for monitoring the utilization of the servers on the grid. It goes into detail on troubleshooting the grid. It shows how to install, configure and use Ganglia for grid monitoring.
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. Ganglia is an open-source project that grew out of the University of California, Berkeley Millennium Project. Ganglia packages are provided as part of the BYOG toolkit
A useful sections is “Successful practices for the IBM Infosphere Information Server grid environment†and kudos to the documentation team for avoiding the term “best practicesâ€. It’s got configuration guidelines and tuning guidelines. The grid setup section covers designing the grid, building it and testing it, installing a toolkit, tailoring jobs etc.
The document then moves into the grid migration scenario:
In this chapter we describe a step-by-step approach to migrating an existing single server IBM Information Server environment on a Red Hat Enterprise Linux Advanced Server 4 platform to a grid environment involving one frontend (head) node, three compute nodes, and a dedicated node is designated as the “standby†frontend node for high availability. The topics covered include: _ Business requirements _ Environment configuration _ General approach _ Install and configure the grid environment _ Tailor existing DS/IA jobs to operate in grid environment _ Manage the grid environment _ Failover to the standby node NILE _ Failback to the original frontend node ORION
This is a highly technical step by step scenario with instructions, screenshots, scripts and download materials. You can see Config files, listq command results, DataStage example jobs, the output from toolkit scripts, Load Leveller commands, virtual IP addresses, node failover and recovery etc.
No hampsters were harmed in the creation of this RedBook (so far).