Glue Pig Module Configuration

Overview

Pig is configured using a properties file, the properties file contains the location to the NameNode and JobTracker. This file can be found at /etc/pig/conf/pig.properties (depending on the pig install) As a requirement from the HDFS Api the core-site.xml file must be on the classpath, but the file can be empty with only <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
</configuration>

The Pig Module allows you to setup and configure Pig for multiple clusters by pointing a logical name to a different properties file, i.e. you can run pig on two different clusters if you wanted to.

Configuration file

All modules are configured in the /opt/glue/conf/workflow_modules.groovy file

Note for CHD4 Users

Use the below classpath in your pig module configuration

classpath = ['/usr/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.1.0.jar',
                '/usr/lib/pig', '/usr/lib/pig/lib', '/opt/glue/lib/', '/usr/lib/hadoop', '/usr/lib/hadoop/lib', '/opt/glue/conf']

Its important to have the mr1 jar first in the classpath

Note hadoop 2.2.0 Users

Glue comes with hadoop 2.2.0 jars in the /opt/glue/pig/lib-hadoop-2.2.0 directory, and a hadoopless version of pig in /opt/glue/lib-pig

By default use the hadoop installed locally on the client, but if non is available use the glue provided hadoop.

classpath = ['/opt/glue/lib', '/opt/glue/conf', '/opt/hadoop', '/opt/hadoop/conf', '/opt/glue/lib-pig']

Configuration for pig is different with yarn because there is no job tracker. A pig configuration should look like:

fs.defaultFS=hdfs://localhost:8020
yarn.resourcemanager.address=localhost
yarn.resourcemanager.scheduler.address=localhost
yarn.resoucemanager.resource-tracker.address=localhost

Configuration Example

pig{
    className='org.glue.modules.hadoop.PigModule'
    isSingleton=false
    config{
    jars = []
    classpath = ['/usr/lib/pig', '/usr/lib/pig/lib', '/opt/glue/lib/', '/usr/lib/hadoop', '/usr/lib/hadoop/lib', '/opt/glue/conf']
       clusters{
               mycluster1{
                       isDefault=true
                       pigProperties="/opt/glue/conf/mycluster1.properties"
               }
               mycluster2{
                       pigProperties="/opt/glue/conf/mycluster2.properties"
               }
        }

    }

}

Properties explained: