Running Hadoop on PICO resources

Contents:

Running a Map-Reduce/Spark job using a pseudo Hadoop environment on PICO

This guide will walk you through the steps of submitting an Hadoop job on our cluster for data analytics.

Prerequisites

Before learning to submit your first Hadoop Job, please read the documention for:

How to ask for PICO access (link)
How to use your username on CINECA systems (link)
How to use modules for loading environments on PICO cluster (link)
How to submit a job to PICO scheduler (link)
General guide for PICO (link)

IMPORTANT This approach requires a good knowledge of the Apache Hadoop platform including its underlying components.If you lack of this backgroud, please use the IBM BigInsights platform instead, see below.

Hadoop Job template

This is a script template for submitting a job on PICO scheduler:

#!/bin/bash

#PBS -A <MY_ACCOUNT>
#PBS -l walltime=01:00:00
#PBS -l select=1:ncpus=20:mem=96GB
#PBS -q parallel

## Environment configuration
module load profile/advanced hadoop/1.2.1
# Configure a new HADOOP instance using PBS job information
$MYHADOOP_HOME/bin/myhadoop-configure.sh -c $HADOOP_CONF_DIR
# Start the Datanode, Namenode, and the Job Scheduler 
$HADOOP_HOME/bin/start-all.sh 

### ...your job goes here...

# Stop HADOOP services
$MYHADOOP_HOME/bin/myhadoop-shutdown.sh

You can copy/paste this template into a file in your inside your scratch area on PICO cluster.

Submitting your first Hadoop PICO job

Login on PICO cluster:

$ ssh login.pico.cineca.it -l myusername

Then download source codes within $HOME or $CINECA_SCRATCH.

After setting up your codes you can use the template:

Copy the Hadoop script template above inside a file (e.g. inside $HOME/myhadoop.sh).

Change the selected PBS script accordingly to your destination directory.
Add your Hadoop operations, inside the section "...your job goes here...".
Save the file and submit your job:

$ qsub $HOME/myhadoop.sh

To check the job status:

$ qstat -u $USER

At job completion you will find data inside the directory provided inside the script.

(qui sarebbe utile un esempio completo e funzionante)

Running a Map-Reduce/Spark job using the IBM BigInsights platform

Login on BigInsights cluster:

$ ssh biginsights.pico.cineca.it

Then download source codes within $HOME or $CINECA_SCRATCH.

After setting up your codes you can run it on the cluster.

An example of submission of a map reduce job (wordcount) is reported below

#Load on Hadoop the "input" directory on the local system containing text files
hadoop fs -put input/ input
#View the files in the Hadoop file system
hadoop fs -ls input
#Run the java "wordcount" job
hadoop jar wordcount.jar WordCount input output
#View the output
hadoop fs -cat output/* |more
#Get the result from hadoop
hadoop fs -getmerge output/* output.txt

To access BigInsights services from web click on:

https://biginsights.pico.cineca.it:8443/gateway/default/BigInsightsWeb/

Suggested browser: Mozilla Firefox

For more informations

Attend the next Workshop for Data Analytics
Read the slides from the latest Workshop

Outgoing links:

Page tree