Contents:
Running a Map-Reduce/Spark job using a pseudo Hadoop environment on PICO
This guide will walk you through the steps of submitting an Hadoop job on our cluster for data analytics.
Prerequisites
Before learning to submit your first Hadoop Job, please read the documention for:
- How to ask for PICO access (link)
- How to use your username on CINECA systems (link)
- How to use modules for loading environments on PICO cluster (link)
- How to submit a job to PICO scheduler (link)
- General guide for PICO (link)
IMPORTANT This approach requires a good knowledge of the Apache Hadoop platform including its underlying components.If you lack of this backgroud, please use the IBM BigInsights platform instead, see below.
Hadoop Job template
This is a script template for submitting a job on PICO scheduler:
#!/bin/bash
#PBS -A <MY_ACCOUNT>
#PBS -l walltime=01:00:00
#PBS -l select=1:ncpus=20:mem=96GB
#PBS -q parallel
## Environment configuration
module load profile/advanced hadoop/1.2.1
# Configure a new HADOOP instance using PBS job information
$MYHADOOP_HOME/bin/myhadoop-configure.sh -c $HADOOP_CONF_DIR
# Start the Datanode, Namenode, and the Job Scheduler
$HADOOP_HOME/bin/start-all.sh
### ...your job goes here...
# Stop HADOOP services
$MYHADOOP_HOME/bin/myhadoop-shutdown.sh
You can copy/paste this template into a file in your inside your scratch area on PICO cluster.
Submitting your first Hadoop PICO job
Login on PICO cluster:
$ ssh login.pico.cineca.it -l myusername
Then download source codes within $HOME or $CINECA_SCRATCH.
After setting up your codes you can use the template:
- Copy the Hadoop script template above inside a file (e.g. inside $HOME/myhadoop.sh).
- Change the selected PBS script accordingly to your destination directory.
- Add your Hadoop operations, inside the section "...your job goes here...".
- Save the file and submit your job:
$ qsub $HOME/myhadoop.sh
To check the job status:
$ qstat -u $USER
At job completion you will find data inside the directory provided inside the script.
(qui sarebbe utile un esempio completo e funzionante)
Running a Map-Reduce/Spark job using the IBM BigInsights platform
Login on BigInsights cluster:
$ ssh biginsights.pico.cineca.it
Then download source codes within $HOME or $CINECA_SCRATCH.
After setting up your codes you can run it on the cluster.
An example of submission of a map reduce job (wordcount) is reported below
#Load on Hadoop the "input" directory on the local system containing text files hadoop fs -put input/ input #View the files in the Hadoop file system hadoop fs -ls input #Run the java "wordcount" job hadoop jar wordcount.jar WordCount input output #View the output hadoop fs -cat output/* |more #Get the result from hadoop hadoop fs -getmerge output/* output.txt
To access BigInsights services from web click on:
https://biginsights.pico.cineca.it:8443/gateway/default/BigInsightsWeb/
Suggested browser: Mozilla Firefox
For more informations