You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 98 Next »

(last updated: )

In this page:



 

The following guide describe how to load, configure and use matlab @ CINECA's cluster 

Pre-requisites

To use matlab on CINECA HPC's environment, please check the following pre-requisites:

  1. You have a valid account defined on HPC cluster, see "Become a User".
  2. You have a valid Matlab license file and the license server (port@hostname) is reachable and readable from the selected HPC cluster. Please ask your IT Dept to follow this guide and contact CINECA's staff. 

When the aforementioned step has been completed, please inform the CINECA's staff of your account and of the license server's details. Your username and account will be authorize to use: 

      1. Your license server for the "standard" MATLAB features (MATLAB, SIMULINK, Distrib_Computing_Toolbox, etc,. etc.)
      2. CINECA's license server for the  MATLAB_Distrib_Comp_Engine (256 licenses available)

Configuration

Configure MATLAB to submit jobs will by default to the cluster rather than submit to the local node or local machine.

Cluster Configuration

Load the MATLAB module in Marconi/Galielo cluster

Log-in to Marconi/Galileo cluster, and type

[user@node166 ~]$ module load profile/eng autoload matlab/R2018a

Start MATLAB, without desktop.

[user@node166 ~]$ matlab -nodisplay

Configure MATLAB to run parallel jobs on CINECA's cluster by calling configCluster.  For each cluster, configCluster only needs to be called once per version of MATLAB, and once per user.

>> configCluster

A new profile will be created. In this case: cineca local R2018a

Then you can check the list of available profiles.

>> ALLPROFILES = parallel.clusterProfiles
ALLPROFILES =
1x4 cell array
{'cineca local R2018a'} {'local'} {'MATLAB Parallel Cloud'}

>> [ALLPROFILES, DEFAULTPROFILE] = parallel.clusterProfiles returns a cell array containing the names of all available profiles, and separately the name of the default profile.

don't use the local profile!!

Please, set to 

DEFAULTPROFILE = 'cineca local R2018a'

Local Configuration

After having installed MATLAB to your workstation, copy the integrations scritps into your installation path $MATLAB and add the folder and  subfolders to your MATLAB search path, as follows: 

  • Unpack and copy the integration scripts into a directory of your Installation path $MATLAB. In the following example, for Windows local machine, the directory conf_files in the installation path $MATLAB

 

  • Add the MATLAB integration scripts to your MATLAB Path, by clicking on Set Path button. Add the following folder and subfolders:
    • $MATLAB\Cineca.remote.R2018a
    • $MATLAB\Cineca.remote.R2018a\IntegrationScripts
    • $MATLAB\Cineca.remote.R2018a\IntegrationScripts\galielo
    • $MATLAB\Cineca.remote.R2018a\IntegrationScripts\marconi

 

  • Configure MATLAB to run parallel jobs on CINECA's cluster by calling configCluster.  For each cluster, configCluster only needs to be called once per version of MATLAB, and once per user.

  • Remember to set properly the permission on your $MATLAB dir to store job and tasks. 

 

  • Submission to the remote cluster requires SSH credentials, see screenshot below.  You will be prompted for your ssh username and password or identity file (private key).  The username and location of the private key will be cached in MATLAB for future sessions.

Configuring Jobs

Prior to submitting the job, various parameters have to be specified in order to be passed to jobs, such as queue, username, e-mail, etc. 

NOTE:  Any parameters specified using the below workflow will be persistent between MATLAB sessions.

Before specifying any parameters, you will need to obtain a handle to the cluster object.

>> % Get a handle to the cluster
>> c = parcluster;

 

You are required to specify an Account Name prior to submitting a job. You can retrieve your Account Name / Bugdet info by using the saldo command

>> % Specify an Account to use for MATLAB jobs
>> c.AdditionalProperties.AccountName = ‘account_name’

>> c.AdditionalProperties.QueueName = 'queue name'

For Galileo cluster queue name = 'gll_usr_prod'

For Marconi A1 queue nam,e = 'bdw_usr_dbg' or  'bdw_usr_prod' with wall time, respectively,  2 or 24 hours 

 

You can specify other additional parameters along with your job.

>> % Specify e-mail address to receive notifications about your job
>> c.AdditionalProperties.EmailAddress = ‘test@foo.com
>> % Specify the walltime
>> c.AdditionalProperties.WallTime = '00:10:00'

>> % Specify processor cores per node.  Default is 18 for Marconi and 36 for Galileo.

>> c.AdditionalProperties.ProcsPerNode = 18

>> % Turn onthe Debug Message.  Default is off (logical boolean true/false).

>> c.AdditionalProperties.DebugMessagesTurnedOn = true

 

To see the values of the current configuration options, call the specific AdditionalProperties name.

>> % To view current configurations
>> c.AdditionalProperties.QueueName

 

Or to see the entire configuration

>> c.AdditionalProperties

 

To clear a value, assign the property an empty value (‘’, [], or false).

>> % To clear a configuration that takes a string as input 
>> c.AdditionalProperties.EmailAddress = ‘ ’

 

To save a profile, with your configuration ....

 >> c.saveProfile

 

Serial Jobs

Use the batch command to submit asynchronous jobs to the cluster.  The batch command will return a job object which is used to access the output of the submitted job.  See the MATLAB documentation for more help on batch.

>> % Get a handle to the cluster
>> c = parcluster;

 

>> % Submit job to query where MATLAB is running on the cluster

>> j = c.batch(@pwd, 1, {});

 

>> % Query job for state:  queued | running | finished 

>> j.State

 

>> % If state is finished, fetch results

>> j.fetchOutputs{:}

 

>> % Display the diary

diary(j)

 

>> % Delete the job after results are no longer needed

>> j.delete

 

To retrieve a list of currently running or completed jobs, call parcluster to retrieve the cluster object.  The cluster object stores an array of jobs that were run, are running, or are queued to run.  This allows us to fetch the results of completed jobs.  Retrieve and view the list of jobs as shown below.

>> c = parcluster;
>> jobs = c.Jobs

 

Once we’ve identified the job we want, we can retrieve the results as we’ve done previously.

fetchOutputs is used to retrieve function output arguments; if using batch with a script, use load instead.   Data that has been written to files on the cluster needs be retrieved directly from the file system.

 

To view results of a previously completed job:

>> % Get a handle on job with job array index 2 (2x1)j
>> j2 = c.Jobs(2);

 

NOTE: You can view a list of your jobs, as well as their IDs, using the above c.Jobs command. 

>> % Fetch results for job with ID 2
>> j2.fetchOutputs{:}

 

>> % If the job produces an error view the error log file
>> c.getDebugLog(j.Tasks(1))

 

NOTE: When submitting independent jobs, with multiple tasks, you will have to specify the task number.

Parallel Jobs

Users can also submit parallel workflows with batch.

The following example are available at the following directory / env variable available after loaded the module 

CIN_EXAMPLE=/cineca/prod/opt/tools/matlab/CINECA_example

Example 1: parallel_example.m

Example 2: hpccLinpack.m

Example 1 

parallel_example.m

 Let’s use the following example for a parallel job. 

We will use the batch command again, but since we’re running a parallel job, we’ll also specify a MATLAB Pool.     

>> % Get a handle to the cluster
>> c = parcluster;

 

>> % Submit a batch pool job using 4 workers for 16 iterations
>> j = c.batch(@parallel_example, 1, {}, ‘Pool’, 4);

 

For more info on the batch commands, please see the MATLAB on-line guide.

>> % View current job status
>> j.State

 

>> % Fetch the results after a finished state is retrieved

>> j.fetchOutputs{:}
ans = 15.5328

 

>> % Display the diary

diary(j)

 

The job ran in 15.53 sec. using 4 workers. 

Note that these jobs will always request N+1 cores for your job, since one worker is required to manage the batch job and pool of workers.  

For example, a job that needs eight workers will consume nine CPU cores. 

We’ll run the same simulation, but increase the Pool size.  This time, to retrieve the results at a later time, we’ll keep track of the job ID.

 

NOTE: For some applications, there will be a diminishing return when allocating too many workers, as the overhead may exceed computation time.   

 

>> % Get a handle to the cluster
>> c = parcluster;

 

>> % Submit a batch pool job using 8 workers for 16 simulations
>> j = c.batch(@parallel_example, 1, {}, ‘Pool’, 8);

 

>> % Get the job ID
>> id = j.ID
Id = 4
>> % Clear workspace, as though we quit MATLAB
>> clear j

 

Once we have a handle to the cluster, we’ll call the findJob method to search for the job with the specified job ID.  

>> % Get a handle to the cluster
>> c = parcluster;

 

>> % Find the old job
>> j = c.findJob(‘ID’, 4);

 

>> % Retrieve the state of the job
>> j.State
ans
finished
>> % Fetch the results
>> j.fetchOutputs{:};
ans =
6.4488
>> % If necessary, retrieve output/error log file
>> c.getDebugLog(j)

 

The job now runs 6.4488 seconds using 8 workers.  Run code with different number of workers to determine the ideal number to use.

Example 2

hpccLinpack.m

This example is taken from $MATLAB_HOME/toolbox/distcomp/examples/benchmark/hpcchallenge/ 

It is an implementation of the HPCC Global HPL benchmark

function perf = hpccLinpack( m )

The function input is the size of the real matrix m-by-m to be inverted. The outputs is  perf, performance in gigaflops

Start to submit on 1 core, with m=1024: 

j = c.batch(@hpccLinpack, 1, {1024}, 'Pool', 1)
Data size: 0.007812 GB
Performance: 1.576476 GFlops

Repeat on one full node 

j = c.batch(@hpccLinpack, 1, {1024}, 'Pool', 35)
Data size: 0.007812 GB
Performance: 0.311111 GFlops

Increase the size of the matrix, 

 j = c.batch(@hpccLinpack, 1, {2048}, 'Pool', 35)
Data size: 0.031250 GB
Performance: 2.466961 GFlops
j = c.batch(@hpccLinpack, 1, {4096}, 'Pool', 35)
Data size: 0.125000 GB
Performance: 47.951919 GFlops

Use two full nodes 

Data size: 0.125000 GB
Performance: 14.709730 GFlops

and double matrix size 

j = c.batch(@hpccLinpack, 1, {8192}, 'Pool', 71)

Data size: 0.500000 GB
Performance: 86.003520 GFlops

...

..

j = c.batch(@hpccLinpack, 1, {16384}, 'Pool', 35)

Data size: 2.000000 GB
Performance: 356.687648 GFlops

MdcsDataLocation

Please take into account that the info and the metadatas on your jobs are stored in your $HOME directory under: 

$HOME/MdcsDataLocation/cineca/R2018a/local

Please check the disk quota of this directory and remove old/unused metadatas.

Debugging

 

If a serial job produces an error, we can call the getDebugLog method to view the error log file.

>> j.Parent.getDebugLog(j.Tasks(1))

 

When submitting independent jobs, with multiple tasks, you will have to specify the task number.  For Pool jobs, do not deference into the job object.

>> j.Parent.getDebugLog(j)

 

The scheduler job ID can be derived by calling schedID

>> schedID(j)
ans
25539

To learn More

To learn more about the MATLAB Parallel Computing Toolbox, check out these resources:

 

Parallel Computing Benchmarck and Performance

 


Checking the license server

To check the CINECA's license server status, please type: 

[user@node165]$ module load profile/eng autoload matlab/R2018a
[user@node165]$ lmutil lmstat -a -c $MLM_LICENSE_FILE

Feature enabled: 

Users of MATLAB:  5 licenses issued

Users of Distrib_Computing_Toolbox:  5 licenses issued

Users of MATLAB_Distrib_Comp_Engine:  256 licenses issued

Please take into account that when opening matlab you will check out the following features from your license server : 

  • 1 license of MATLAB
  • 1 license of Distrib_Computing_Toolbox

To check to use of these features, after having loaded matlab, type: 

lmutil lmstat -c port@your_license_server -f MATLAB
lmutil lmstat -c port@your_license_server -f Distrib_Computing_Toolbox

To check the use of the MATLAB_Distrib_Comp_Engine from the CINECA's license server, please type 

lmutil lmstat -c 27200@license02-a.cineca.it:/homelic/licmath1/license.lic -f MATLAB_Distrib_Comp_Engine

Troubleshooting

If you have the following issue when launching a parallel job: 
 About to evaluate task with DistcompEvaluateFileTask
 About to evaluate task with DistcompEvaluateFileTask
About to evaluate task with DistcompEvaluateFileTask
 Enter distcomp_evaluate_filetask_core
Enter distcomp_evaluate_filetask_core
This process will exit on any fault.
....
..
 Error initializing MPI: Undefined function or variable 'mpiInitSigFix'.

Please type on matlab command line:

>> rehash toolboxcache

To updates the cache file, and make permanent the patch for InfiniBand QDR applied on top of the standard matlab installation.

 

 


dependences:


 outgoing links:

  • No labels