Parallelization
TBPLaS supports MPI+OpenMP hybrid parallelization. Both the features can be separately enabled or
disabled during the compilation of tbplas-cpp
, while OpenMP is enabled by default. For more
details, refer to the Install section.
Controlling the parallelization
The number of OpenMP threads at run time is controlled by the OMP_NUM_THREADS
environment variable.
If tbplas-cpp
has been compiled with MKL support, then the MKL_NUM_THREADS
environment variable
will also take effect. If none of the environment variables has been set, OpenMP will make use of all
the CPU cores on the computing node. To switch off OpenMP at run time, set the environment variables to 1.
The number of MPI processes at run time is controlled by the MPI launcher, which receives arguments from
the command line, environment variables or configuration file. The user is recommended to check the manual of job
queuing system on the computer for properly setting the environment variables and invoking the MPI launcher.
For computers without a queuing system, e.g., laptops, desktops and workstations, the MPI launcher should be
mpirun
or mpiexec
, while the number of processes is controlled by the -np
command line option.
Hybrid MPI+OpenMP parallelization is achieved by enabling MPI and OpenMP simultaneously.
Optimal parallelization setup
The optimal parallelization configuration, i.e., the numbers of MPI processes and OpenMP threads, depend on the hardware, the model size and the type of calculation. Generally speaking, matrix diagonalization for a single \(\mathbf{k}\)-point is poorly parallelized over threads. But the diagonalization for multiple \(\mathbf{k}\)-points can be efficiently parallelized over processes. Therefore, for exact diagonalization, it is recommended to run in pure MPI-mode by setting the number of MPI processes to the total number of allocated CPU cores and the number of OpenMP threads to 1. However, MPI-based parallelization uses more RAM due to duplicate data on each process. So, if the available RAM imposes a limit, try to use less processes and more threads. Anyway, the product of the numbers of processes and threads should be equal to the number of allocated CPU cores. For example, if you have allocated 16 cores, then you can try 16 processes \(\times\) 1 thread, 8 processes \(\times\) 2 threads, 4 processes \(\times\) 4 threads, etc.
For TBPM calculations, the number of random initial wave functions should be divisible by the number of processes. For example, if you are going to consider 16 initial wave functions, then the number of processes should be 1, 2, 4, 8, or 16. The number of threads should be set according to the number of processes. Again, if the RAM size is a problem, try to decrease the number of processes and increase the number of threads.
Parallel efficiency Notice
If your computer has HyperThreading enabled in BIOS or UEFI, then the number of available cores will be double of the physical cores. DO NOT use the virtual cores from HyperThreading since there will be significant performance loss. Check the handbook of your CPU for the number of physical cores.
Visualization and I/O
If MPI-based parallelization is enabled, either in pure MPI or hybrid MPI+OpenMP mode, special care
should be taken to output and plotting part of the job script. These operations should be performed
on the master process only, otherwise the output will mess up or files get corrupted, since all the
processes will try to modify the same file or plotting the same data. The built-in IO operations of
tbplas-cpp
is MPI-aware, i.e., they are safe when MPI is enabled. But for user-define IO and
visualization, remember to check the rank of MPI process before action. The solver, analyzer and
visualizer classes offer an is_master
attribute to detect the master process, whose usage is
demonstrated in tbplas-cpp/samples/speedtest
. Taking the test_diag_dyn_pol
function as
example:
1def test_diag_dyn_pol():
2 # ... ...
3
4 if lind.is_master:
5 timer.report_total_time()
6 vis = tb.Visualizer()
7 vis.plot_xy(omegas/t, -dyn_pol[0].imag*t*a**2, color="b")
8 vis.plot_xy(omegas, epsilon[1].real, color="r")
9
10 # calc_epsilon DOES NOT save data as C++ counterpart. We have to save
11 # the data manually.
12 save_xy(f"{lind.config.prefix}_epsilon.dat", omegas, epsilon.T)
The time usage report, visualization and user-defined IO are limited to the master-process only,
by calling the is_master
attribute of Lindhard
class.
Example scripts for SLURM
If you are using a super computer with queuing system like SLURM
, PBS
or LSF
, then you
need another batch script for submitting the job. Contact the administrator of the super computer
for help on preparing the script.
Here we provide two batch scripts for the SLURM
queing system as examples. SLURM
has the
following options for specifying parallelization details:
nodes: number of nodes for the job
ntasks-per-node: number of MPI processes to spawn on each node
cpus-per-task: number of OpenMP threads for each MPI process
Suppose that we are going to use 4 initial conditions and 1 node. The node has 2 CPUs with 16 cores per CPU. The number of MPI processes should be either 1, 2, 4, and the number of OpenMP threads is 32, 16, 8, respectively. We will use 2 processes * 16 threads. The batch script is as following:
#! /bin/bash
#SBATCH --partition=cpu_2d
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
#SBATCH --job-name=test_mpi
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
# Load modules
module load tbplas
# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py
Here we assume submitting to the cpu_2d
partition. Since we are going to use 1 node, we set nodes
to 1. For each node 2 MPI processes will be spawned, so ntasks-per-node
is set to 2. There are 32
physical cores on the node, so cpus-per-task
is set to 16.
If you want pure OpenMP parallelization, here is another example:
#! /bin/bash
#SBATCH --partition=cpu_2d
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --job-name=test_omp
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
# Load modules
module load tbplas
# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py
In this script the number of processes is set to 1, and the number of threads per process is set to the total number of physical cores.