How do I get started with Hadoop Oozie

Running Apache Oozie on Azure HDInsight clusters with enterprise security package

  • 6 minutes to read

Apache Oozie is a workflow and coordination system for managing Apache Hadoop jobs. Oozie is built into the Hadoop stack and supports the following jobs:

  • Apache MapReduce
  • Apache Pig
  • Apache Hive
  • Apache Sqoop

You can also use Oozie to schedule specific jobs for a system, such as Java programs or shell scripts.

requirement

Azure HDInsight Hadoop cluster with Enterprise Security Package (ESP). For more information, see Configuring HDInsight Clusters With Enterprise Security Package.

Connect to an ESP cluster

For more information on Secure Shell (SSH), see Connect to HDInsight (Hadoop) using SSH.

  1. Connect to the HDInsight cluster using SSH:

  2. Use the command to verify that Kerberos authentication was successful. If not, use to initiate Kerberos authentication.

  3. Sign in to the HDInsight gateway to register the OAuth token required to access Azure Data Lake Storage (ADLS):

    The status response code 200 OK indicates successful registration. Check the user name and password if a reply with the note "unauthorized" (e.g. 401) is received.

Define the workflow

Oozie workflow definitions are written in the Apache process definition language (hPDL). hPDL is an XML process definition language. To define the workflow, do the following:

  1. Set up the workspace of a domain user:

    Replace with the name of the domain user. Replace with the path of the root directory for the domain user. Replace with the data platform version of your cluster.

  2. Use the following instruction to create and edit a new file:

  3. After the nano editor opens, use the following XML code as the content of the file:

  4. Replace with the name of the cluster.

  5. Press CTRL + Xto save the file. Give Y a. Then press the ENTER.

    The workflow is divided into two parts:

    • Credential (Credentials): This section is used to enter the credentials that will be used to authenticate Oozie actions:

      This example uses authentication for Hive actions. For more information, see Action Authentication.

      The credentials service enables Oozie actions to impersonate the user in order to access Hadoop services.

    • Action (Action): This section contains three actions - "map-reduce", the Hive Server 2 action, and the Hive Server 1 action:

      • The action "map-reduce" executes an example from an Oozie package for "map-reduce", which outputs the aggregated word count.

      • The Hive Server 2 and Hive Server 1 actions query a sample Hive table provided with HDInsight.

      The Hive actions use the credentials defined in the credentials section to authenticate using the keyword in the action item.

  6. Copy the file to:

  7. Replace with your username for the domain.

Define the properties file for the Oozie job

  1. Use the following instruction to create and edit a new job properties file:

  2. After the nano editor opens, use the following XML code as the content of the file:

    • Use the URI for the property if you are using Azure Data Lake Storage Gen1 as your primary cluster storage. If you're using Azure Blob Storage, change this to. If you're using Azure Data Lake Storage Gen2, change this to.
    • Replace with your username for the domain.
    • Replace with the nickname for the cluster. If the cluster name is "https: // [Example link] sixadoopcontoso.azurehdisnight.net "represents the first six characters of the cluster: sixad.
    • Replace with the JDBC URL from the Hive configuration. An example for this is "jdbc: hive2: // headnodehost: 10001 /; transportMode = http".
    • To save the file, press Ctrl + X, type in, and press the ENTER.

    This properties file must be available locally when running Oozie jobs.

Create custom Hive scripts for Oozie jobs

You can create the two Hive scripts for Hive Server 1 and Hive Server 2 as shown in the following sections.

File for Hive Server 1

  1. Create and edit a file for Hive Server 1 actions:

  2. Create the script:

  3. Save the file in the Apache Hadoop Distributed File System (HDFS):

File for Hive Server 2

  1. Create and edit a file for Hive Server 2 Actions:

  2. Create the script:

  3. Save the file in HDFS:

Submit Oozie orders

Submitting Oozie jobs for ESP clusters is the same as submitting Oozie jobs in clusters without ESP.

For more information, see Use Apache Oozie with Apache Hadoop to define and run a workflow in Linux-based Azure HDInsight.

Results of an Oozie order submission

Oozie jobs are carried out for the user. As a result, both Apache Hadoop YARN and Apache Ranger audit logs show the jobs performed with the identity of the user. The output from the command line interface of an Oozie job looks like the following code example:

The Ranger audit logs for Hive Server 2 actions show that Oozie is taking the action for the user. The Ranger and YARN views are only available to the cluster administrator.

Configure user authorization in Oozie

Oozie has a user authorization configuration that can prevent users from terminating or deleting other users' jobs. To enable this configuration, set to.

For more information, see Apache Oozie Installation and Configuration.

For components such as Hive Server 1 for which the Ranger plug-in is not available or not supported, only the undifferentiated HDFS authorization is possible. Differentiated authorization is only possible with the help of Ranger plug-ins.

Access the Oozie web user interface

The Oozie web user interface provides a web-based display of the status of Oozie jobs in the cluster. To access the web user interface, do the following in ESP clusters:

  1. Add an edge node and enable SSH Kerberos authentication.

  2. Follow the steps outlined in Oozie Web UI to enable SSH tunneling to the edge node and access the web UI.

Next Steps