BigData Workflow Engine for Hadoop, Hbase, Netezza, Pig, Hive, Cascalog ...
Glue is a job execution engine, written in Java and Groovy. workflows are written in Groovy DSL (simple statements), Jython or JRuby and use pre-developed modules to interact with external resources e.g. DBs, Hadoop, Netezza, FTP etc.
Glue helps to 'Glue' together a series of interactions with external systems.
Examples:
Load data from N mysql tables Push data to Hadoop HDFS Run a Pig Job Download output from HDFS Push output to MySQL/Netezza
One big headache in HDFS BigData is to run script when data becomes available and not just on timed frequency, i.e. when data arrives we want our workflow(s) to start.
Glue via GlueCron gives the ability to register one or more workflows to one or more HDFS directories.
Groovy is supported as a DSL.
tasks{
myprocess1 {
tasks = { ctx ->
ctx.sql.eachSqlResult('glue', 'select unit_id from units', { rs -> println rs })
}
}
}
Clojure scripts can be written using the Groovy and Java libraries provided by Glue.
e.g.
(.exec (.ctx cascalog) (def input (hfs-textline "/data/a.log")) (?<- (stdout) [?line] (input ?line)) )
Jython scripts can be written using the Groovy and Java libraries provided by Glue.
e.g.
def f2(res):
print(str(res))
ctx.sql().eachSqlResult('glue', 'select unit_id from units', f2)
JRuby scripts can be written using the Groovy and Java libraries provided by Glue.
e.g.
$ctx.sql().eachSqlResult('glue', 'select unit_id from units', Closure.new(
lambda{ | res |
puts "Hi #{res}"
}
))
XML is a terrible language for humans to write in, expecially when writing workflows and process oriented scripts.