Parallelization =============== TBPLaS supports MPI+OpenMP hybrid parallelization. Both the features can be separately enabled or disabled during the compilation of ``tbplas-cpp``, while OpenMP is enabled by default. For more details, refer to the :ref:`install` section. Controlling the parallelization ------------------------------- The number of OpenMP threads at run time is controlled by the ``OMP_NUM_THREADS`` environment variable. If ``tbplas-cpp`` has been compiled with MKL support, then the ``MKL_NUM_THREADS`` environment variable will also take effect. If none of the environment variables has been set, OpenMP will make use of all the CPU cores on the computing node. To switch off OpenMP at run time, set the environment variables to 1. The number of MPI processes at run time is controlled by the MPI launcher, which receives arguments from the command line, environment variables or configuration file. The user is recommended to check the manual of job queuing system on the computer for properly setting the environment variables and invoking the MPI launcher. For computers without a queuing system, e.g., laptops, desktops and workstations, the MPI launcher should be ``mpirun`` or ``mpiexec``, while the number of processes is controlled by the ``-np`` command line option. Hybrid MPI+OpenMP parallelization is achieved by enabling MPI and OpenMP simultaneously. Optimal parallelization setup ----------------------------- The optimal parallelization configuration, i.e., the numbers of MPI processes and OpenMP threads, depend on the hardware, the model size and the type of calculation. Generally speaking, matrix diagonalization for a single :math:`\mathbf{k}`-point is poorly parallelized over threads. But the diagonalization for multiple :math:`\mathbf{k}`-points can be efficiently parallelized over processes. Therefore, for exact diagonalization, it is recommended to run in pure MPI-mode by setting the number of MPI processes to the total number of allocated CPU cores and the number of OpenMP threads to 1. However, MPI-based parallelization uses more RAM due to duplicate data on each process. So, if the available RAM imposes a limit, try to use less processes and more threads. Anyway, the product of the numbers of processes and threads should be equal to the number of allocated CPU cores. For example, if you have allocated 16 cores, then you can try 16 processes :math:`\times` 1 thread, 8 processes :math:`\times` 2 threads, 4 processes :math:`\times` 4 threads, etc. For TBPM calculations, the number of random initial wave functions should be divisible by the number of processes. For example, if you are going to consider 16 initial wave functions, then the number of processes should be 1, 2, 4, 8, or 16. The number of threads should be set according to the number of processes. Again, if the RAM size is a problem, try to decrease the number of processes and increase the number of threads. .. admonition:: Parallel efficiency Notice :class: important If your computer has HyperThreading enabled in BIOS or UEFI, then the number of available cores will be double of the physical cores. DO NOT use the virtual cores from HyperThreading since there will be significant performance loss. Check the handbook of your CPU for the number of physical cores. Visualization and I/O --------------------- If MPI-based parallelization is enabled, either in pure MPI or hybrid MPI+OpenMP mode, special care should be taken to output and plotting part of the job script. These operations should be performed on the master process only, otherwise the output will mess up or files get corrupted, since all the processes will try to modify the same file or plotting the same data. The built-in IO operations of ``tbplas-cpp`` is MPI-aware, i.e., they are safe when MPI is enabled. But for user-define IO and visualization, remember to check the rank of MPI process before action. The solver, analyzer and visualizer classes offer an ``is_master`` attribute to detect the master process, whose usage is demonstrated in ``tbplas-cpp/samples/speedtest``. Taking the ``test_diag_dyn_pol`` function as example: .. code-block:: python :linenos: def test_diag_dyn_pol(): # ... ... if lind.is_master: timer.report_total_time() vis = tb.Visualizer() vis.plot_xy(omegas/t, -dyn_pol[0].imag*t*a**2, color="b") vis.plot_xy(omegas, epsilon[1].real, color="r") # calc_epsilon DOES NOT save data as C++ counterpart. We have to save # the data manually. save_xy(f"{lind.config.prefix}_epsilon.dat", omegas, epsilon.T) The time usage report, visualization and user-defined IO are limited to the master-process only, by calling the ``is_master`` attribute of :class:`.Lindhard` class. Example scripts for SLURM ------------------------- If you are using a super computer with queuing system like ``SLURM``, ``PBS`` or ``LSF``, then you need another batch script for submitting the job. Contact the administrator of the super computer for help on preparing the script. Here we provide two batch scripts for the ``SLURM`` queing system as examples. ``SLURM`` has the following options for specifying parallelization details: * nodes: number of nodes for the job * ntasks-per-node: number of MPI processes to spawn on each node * cpus-per-task: number of OpenMP threads for each MPI process Suppose that we are going to use 4 initial conditions and 1 node. The node has 2 CPUs with 16 cores per CPU. The number of MPI processes should be either 1, 2, 4, and the number of OpenMP threads is 32, 16, 8, respectively. We will use 2 processes * 16 threads. The batch script is as following: .. code-block:: bash #! /bin/bash #SBATCH --partition=cpu_2d #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=16 #SBATCH --job-name=test_mpi #SBATCH --time=24:00:00 #SBATCH --output=slurm-%j.out #SBATCH --error=slurm-%j.err # Load modules module load tbplas # Set number of threads export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK # Change to working directory and run the job cd $SLURM_SUBMIT_DIR mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py Here we assume submitting to the ``cpu_2d`` partition. Since we are going to use 1 node, we set ``nodes`` to 1. For each node 2 MPI processes will be spawned, so ``ntasks-per-node`` is set to 2. There are 32 physical cores on the node, so ``cpus-per-task`` is set to 16. If you want pure OpenMP parallelization, here is another example: .. code-block:: bash #! /bin/bash #SBATCH --partition=cpu_2d #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=32 #SBATCH --job-name=test_omp #SBATCH --time=24:00:00 #SBATCH --output=slurm-%j.out #SBATCH --error=slurm-%j.err # Load modules module load tbplas # Set number of threads export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK # Change to working directory and run the job cd $SLURM_SUBMIT_DIR mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py In this script the number of processes is set to 1, and the number of threads per process is set to the total number of physical cores.