Parallel Guide
==============

Concepts
--------

zCFD uses Intel MPI (Message Passing Interface) to enable running over multiple machines or multiple GPUs, either in one machine or across multiple. A paid license is required for this functionality. This is achieved by decomposing the domain into parallel regions that are distributed to each MPI rank. MPI can leverage high bandwidth network interconnects such as infiniband or AWS’ Elastic Fabric adaptor,

There are two main strategies for running the code in parallel:

**Hybrid MPI/OpenMP** runs one MPI process per socket, reducing the number of MPI ranks at larger core counts. Running in hybrid mode assigns multiple cores to each MPI rank, allowing it to reduce the communication overheads associated with running in parallel. 

**Full MPI** runs one MPI process per core and is typically the best strategy for low core count runs or GPU based runs where you would run 1 MPI rank per GPU available on the node. 

To run in parallel you add the -n option to the run_zcfd script. The provided argument is the number of MPI processes to launch.

Hybrid MPI/OpenMP
-----------------
This is the default mode for zCFD. It will auto detect the number of sockets on each node and set the number of OpenMP threads appropriately. The –tpn option to run_zcfd can be used to set the number of MPI processes to launch on each node. The number of OpenMP threads will be set automatically by dividing the total number of cores between all processes.

For example, if you had two hosts with 2 x 12 core CPUs per host and you wanted to run Hybrid MPI/OpenMP you would request 1 MPI rank per CPU socket and zCFD will automatically set the number of OpenMP threads per rank to match the number of cores per socket (in our case 12).

.. code-block:: bash

    run_zcfd -m <mesh_name> -c <case_name> -n 4 –tpn 2


The number of OpenMP threads assigned to each rank can be overridden by setting **OMP_NUM_THREADS** in the calling environment. 

Full MPI
--------
This is achieved by running zCFD with **OMP_NUM_THREADS** set to 1 and passing -n <tasks> where tasks is the number of MPI ranks.

For example, if you had two hosts with 2 x 12 core CPUs per host and wanted to run full MPI with 1 MPI rank per core you could do the following:

.. code-block:: bash

    export OMP_NUM_THREADS=1
    run_zcfd -p <mesh_name> -c <case_name> -n 48 –tpn 24


Running across multiple machines with a cluster scheduler
---------------------------------------------------------

There are many cluster scheduling systems but commonly used ones are Slurm, PBS Pro or GridEngine. Please refer to either the cluster scheduling software or your local HPC systems’ documentation for how to submit jobs to the appropriate nodes on your system as configuration varies between machines.

Intel MPI has tight integration to most commonly used scheduling systems and so will typically autodetect the nodes to run on.


Running across multiple machines without a cluster scheduler
------------------------------------------------------------

If your cluster does not have a scheduling system then **I_MPI_HYDRA_HOST_FILE** can be set in the environment to point at a file containing the list of hostnames that zCFD will launch on. For example:

.. code-block:: bash

    export I_MPI_HYDRA_HOST_FILE=~/hosts.file

.. note::

    For this to run you need to be able to ssh without a password to each of the machines listed.

An example contents of a hosts file is shown below:

.. code-block:: bash

    Compute01:2
    compute02:2
    Compute03:1

This file specifies three hosts, compute01, compute02, compute03 where 2 ranks will be launched on each of compute01 and compute02 and 1 rank will be launched on compute03.

Alternatively this could be specified as which would launch processes round robin through the list of machines listed in the file.

.. code-block:: bash

    Compute01
    compute02
    Compute03


Libfabric provider selection
----------------------------
zCFD will attempt to autodetect the correct provider to use for MPI communication. It tries the following providers in order: 

.. code-block:: bash

    efa
    psm2
    mlx
    verbs
    tcp
    sockets

If FI_PROVIDER is set then the autodetection is skipped.

Using cluster provided libfabric
--------------------------------

By default zCFD will use the version of libfabric that ships with Intel MPI. For some advanced use cases you may wish to make use of preinstalled libfabric. This may be if you have a custom network driver or a specific configuration on your cluster. 
To enable this then set the **I_MPI_OFI_LIBRARY_INTERNAL** environment variable to 0

.. code-block:: bash

    export I_MPI_OFI_LIBRARY_INTERNAL=0

Pre launch script
------------------
On some systems the defaults setup by zcfd may not be optimal and so a custom script to modify the environment can be injected by setting the environment variable **ZCFD_PRE_LAUNCH** to the path to a script. This script will be sourced from within a bash environment just before the solver is launched.

An example use case for this is to bind a GPU to a specific network interface on a multi-homed network system.

OpenMP affinity
---------------
By default zCFD binds its OpenMP threads to cores using the **OMP_PROC_BIND** and **OMP_PLACES** environment variables.


Other MPI implementations
-------------------------
On platforms where Intel MPI doesn't work then we can build zCFD against a specific implementation of MPI. This would be a custom deployment for your facility so get in touch with us if you need this.

Running at large scale
----------------------
As zCFD is partly written in python, the install consists of a large quantity of small files, which on a large scale cluster can have a negative impact on the cluster filesystem during startup. In order to mitigate this it is possible to stage the code into either local storage or memory on the compute nodes using MPI to broadcast the data.

Contact us if you want to run in this mode.