Parallelization

TBPLaS supports MPI+OpenMP hybrid parallelization. Both the features can be separately enabled or disabled during the compilation of tbplas-cpp, while OpenMP is enabled by default. For more details, refer to the Install section.

Controlling the parallelization

The number of OpenMP threads at run time is controlled by the OMP_NUM_THREADS environment variable. If tbplas-cpp has been compiled with MKL support, then the MKL_NUM_THREADS environment variable will also take effect. If none of the environment variables has been set, OpenMP will make use of all the CPU cores on the computing node. To switch off OpenMP at run time, set the environment variables to 1.

The number of MPI processes at run time is controlled by the MPI launcher, which receives arguments from the command line, environment variables or configuration file. The user is recommended to check the manual of job queuing system on the computer for properly setting the environment variables and invoking the MPI launcher. For computers without a queuing system, e.g., laptops, desktops and workstations, the MPI launcher should be mpirun or mpiexec, while the number of processes is controlled by the -np command line option. Hybrid MPI+OpenMP parallelization is achieved by enabling MPI and OpenMP simultaneously.

Optimal parallelization setup

The optimal parallelization configuration, i.e., the numbers of MPI processes and OpenMP threads, depend on the hardware, the model size and the type of calculation. Generally speaking, matrix diagonalization for a single \(\mathbf{k}\)-point is poorly parallelized over threads. But the diagonalization for multiple \(\mathbf{k}\)-points can be efficiently parallelized over processes. Therefore, for exact diagonalization, it is recommended to run in pure MPI-mode by setting the number of MPI processes to the total number of allocated CPU cores and the number of OpenMP threads to 1. However, MPI-based parallelization uses more RAM due to duplicate data on each process. So, if the available RAM imposes a limit, try to use less processes and more threads. Anyway, the product of the numbers of processes and threads should be equal to the number of allocated CPU cores. For example, if you have allocated 16 cores, then you can try 16 processes \(\times\) 1 thread, 8 processes \(\times\) 2 threads, 4 processes \(\times\) 4 threads, etc.

For TBPM calculations, the number of random initial wave functions should be divisible by the number of processes. For example, if you are going to consider 16 initial wave functions, then the number of processes should be 1, 2, 4, 8, or 16. The number of threads should be set according to the number of processes. Again, if the RAM size is a problem, try to decrease the number of processes and increase the number of threads.

Parallel efficiency Notice

If your computer has HyperThreading enabled in BIOS or UEFI, then the number of available cores will be double of the physical cores. DO NOT use the virtual cores from HyperThreading since there will be significant performance loss. Check the handbook of your CPU for the number of physical cores.

Visualization and I/O

If MPI-based parallelization is enabled, either in pure MPI or hybrid MPI+OpenMP mode, special care should be taken to output and plotting part of the job script. These operations should be performed on the master process only, otherwise the output will mess up or files get corrupted, since all the processes will try to modify the same file or plotting the same data. The built-in IO operations of tbplas-cpp is MPI-aware, i.e., they are safe when MPI is enabled. But for user-define IO and visualization, remember to check the rank of MPI process before action. The solver, analyzer and visualizer classes offer an is_master attribute to detect the master process, whose usage is demonstrated in tbplas-cpp/samples/speedtest. Taking the test_diag_dyn_pol function as example:

def test_diag_dyn_pol():
    # ... ...

    if lind.is_master:
        timer.report_total_time()
        vis = tb.Visualizer()
        vis.plot_xy(omegas/t, -dyn_pol[0].imag*t*a**2, color="b")
        vis.plot_xy(omegas, epsilon[1].real, color="r")

        # calc_epsilon DOES NOT save data as C++ counterpart. We have to save
        # the data manually.
        save_xy(f"{lind.config.prefix}_epsilon.dat", omegas, epsilon.T)

The time usage report, visualization and user-defined IO are limited to the master-process only, by calling the is_master attribute of Lindhard class.

Example scripts for SLURM

If you are using a super computer with queuing system like SLURM, PBS or LSF, then you need another batch script for submitting the job. Contact the administrator of the super computer for help on preparing the script.

Here we provide two batch scripts for the SLURM queing system as examples. SLURM has the following options for specifying parallelization details:

nodes: number of nodes for the job
ntasks-per-node: number of MPI processes to spawn on each node
cpus-per-task: number of OpenMP threads for each MPI process

Suppose that we are going to use 4 initial conditions and 1 node. The node has 2 CPUs with 16 cores per CPU. The number of MPI processes should be either 1, 2, 4, and the number of OpenMP threads is 32, 16, 8, respectively. We will use 2 processes * 16 threads. The batch script is as following:

#! /bin/bash
#SBATCH --partition=cpu_2d
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
#SBATCH --job-name=test_mpi
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

# Load modules
module load tbplas

# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py

Here we assume submitting to the cpu_2d partition. Since we are going to use 1 node, we set nodes to 1. For each node 2 MPI processes will be spawned, so ntasks-per-node is set to 2. There are 32 physical cores on the node, so cpus-per-task is set to 16.

If you want pure OpenMP parallelization, here is another example:

#! /bin/bash
#SBATCH --partition=cpu_2d
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --job-name=test_omp
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

# Load modules
module load tbplas

# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py

In this script the number of processes is set to 1, and the number of threads per process is set to the total number of physical cores.