Parallel Guide

Concepts

zCFD uses Intel MPI (Message Passing Interface) to enable running over multiple machines or multiple GPUs, either in one machine or across multiple. A paid license is required for this functionality. This is achieved by decomposing the domain into parallel regions that are distributed to each MPI rank. MPI can leverage high bandwidth network interconnects such as infiniband or AWS’ Elastic Fabric adaptor,

There are two main strategies for running the code in parallel:

Hybrid MPI/OpenMP runs one MPI process per socket, reducing the number of MPI ranks at larger core counts. Running in hybrid mode assigns multiple cores to each MPI rank, allowing it to reduce the communication overheads associated with running in parallel.

Full MPI runs one MPI process per core and is typically the best strategy for low core count runs or GPU based runs where you would run 1 MPI rank per GPU available on the node.

To run in parallel you add the -n option to the run_zcfd script. The provided argument is the number of MPI processes to launch.

Hybrid MPI/OpenMP

This is the default mode for zCFD. It will auto detect the number of sockets on each node and set the number of OpenMP threads appropriately. The –tpn option to run_zcfd can be used to set the number of MPI processes to launch on each node. The number of OpenMP threads will be set automatically by dividing the total number of cores between all processes.

For example, if you had two hosts with 2 x 12 core CPUs per host and you wanted to run Hybrid MPI/OpenMP you would request 1 MPI rank per CPU socket and zCFD will automatically set the number of OpenMP threads per rank to match the number of cores per socket (in our case 12).

run_zcfd -m <mesh_name> -c <case_name> -n 4 –tpn 2

The number of OpenMP threads assigned to each rank can be overridden by setting OMP_NUM_THREADS in the calling environment.

Full MPI

This is achieved by running zCFD with OMP_NUM_THREADS set to 1 and passing -n <tasks> where tasks is the number of MPI ranks.

For example, if you had two hosts with 2 x 12 core CPUs per host and wanted to run full MPI with 1 MPI rank per core you could do the following:

export OMP_NUM_THREADS=1
run_zcfd -p <mesh_name> -c <case_name> -n 48 –tpn 24

Running across multiple machines with a cluster scheduler

There are many cluster scheduling systems but commonly used ones are Slurm, PBS Pro or GridEngine. Please refer to either the cluster scheduling software or your local HPC systems’ documentation for how to submit jobs to the appropriate nodes on your system as configuration varies between machines.

Intel MPI has tight integration to most commonly used scheduling systems and so will typically autodetect the nodes to run on.

Running across multiple machines without a cluster scheduler

If your cluster does not have a scheduling system then I_MPI_HYDRA_HOST_FILE can be set in the environment to point at a file containing the list of hostnames that zCFD will launch on. For example:

export I_MPI_HYDRA_HOST_FILE=~/hosts.file

Note

For this to run you need to be able to ssh without a password to each of the machines listed.

An example contents of a hosts file is shown below:

Compute01:2
compute02:2
Compute03:1

This file specifies three hosts, compute01, compute02, compute03 where 2 ranks will be launched on each of compute01 and compute02 and 1 rank will be launched on compute03.

Alternatively this could be specified as which would launch processes round robin through the list of machines listed in the file.

Compute01
compute02
Compute03

Libfabric provider selection

zCFD will attempt to autodetect the correct provider to use for MPI communication. It tries the following providers in order:

efa
psm2
mlx
verbs
tcp
sockets

If FI_PROVIDER is set then the autodetection is skipped.

Using cluster provided libfabric

By default zCFD will use the version of libfabric that ships with Intel MPI. For some advanced use cases you may wish to make use of preinstalled libfabric. This may be if you have a custom network driver or a specific configuration on your cluster. To enable this then set the I_MPI_OFI_LIBRARY_INTERNAL environment variable to 0

export I_MPI_OFI_LIBRARY_INTERNAL=0

Pre launch script

On some systems the defaults setup by zcfd may not be optimal and so a custom script to modify the environment can be injected by setting the environment variable ZCFD_PRE_LAUNCH to the path to a script. This script will be sourced from within a bash environment just before the solver is launched.

An example use case for this is to bind a GPU to a specific network interface on a multi-homed network system.

OpenMP affinity

By default zCFD binds its OpenMP threads to cores using the OMP_PROC_BIND and OMP_PLACES environment variables.

Other MPI implementations

On platforms where Intel MPI doesn’t work then we can build zCFD against a specific implementation of MPI. This would be a custom deployment for your facility so get in touch with us if you need this.

Running at large scale

As zCFD is partly written in python, the install consists of a large quantity of small files, which on a large scale cluster can have a negative impact on the cluster filesystem during startup. In order to mitigate this it is possible to stage the code into either local storage or memory on the compute nodes using MPI to broadcast the data.

Contact us if you want to run in this mode.