Parallelization
===============


TBPLaS supports MPI+OpenMP hybrid parallelization. Both the features can be separately enabled or
disabled during the compilation of ``tbplas-cpp``, while OpenMP is enabled by default. For more
details, refer to the :ref:`install` section.

Controlling the parallelization
-------------------------------

The number of OpenMP threads at run time is controlled by the ``OMP_NUM_THREADS`` environment variable.
If ``tbplas-cpp`` has been compiled with MKL support, then the ``MKL_NUM_THREADS`` environment variable
will also take effect. If none of the environment variables has been set, OpenMP will make use of all
the CPU cores on the computing node. To switch off OpenMP at run time, set the environment variables to 1.

The number of MPI processes at run time is controlled by the MPI launcher, which receives arguments from
the command line, environment variables or configuration file. The user is recommended to check the manual of job
queuing system on the computer for properly setting the environment variables and invoking the MPI launcher.
For computers without a queuing system, e.g., laptops, desktops and workstations, the MPI launcher should be 
``mpirun`` or ``mpiexec``, while the number of processes is controlled by the ``-np`` command line option.
Hybrid MPI+OpenMP parallelization is achieved by enabling MPI and OpenMP simultaneously.

Optimal parallelization setup
-----------------------------

The optimal parallelization configuration, i.e., the numbers of MPI processes and OpenMP threads,
depend on the hardware, the model size and the type of calculation. Generally speaking, matrix
diagonalization for a single :math:`\mathbf{k}`-point is poorly parallelized over threads. But the
diagonalization for multiple :math:`\mathbf{k}`-points can be efficiently parallelized over
processes. Therefore, for exact diagonalization, it is recommended to run in pure MPI-mode by
setting the number of MPI processes to the total number of allocated CPU cores and the number of
OpenMP threads to 1. However, MPI-based parallelization uses more RAM due to duplicate data on each
process. So, if the available RAM imposes a limit, try to use less processes and more threads. Anyway,
the product of the numbers of processes and threads should be equal to the number of allocated CPU cores.
For example, if you have allocated 16 cores, then you can try 16 processes :math:`\times` 1 thread,
8 processes :math:`\times` 2 threads, 4 processes :math:`\times` 4 threads, etc.

For TBPM calculations, the number of random initial wave functions should be divisible by the number
of processes. For example, if you are going to consider 16 initial wave functions, then the number of
processes should be 1, 2, 4, 8, or 16. The number of threads should be set according to the number of
processes. Again, if the RAM size is a problem, try to decrease the number of processes and increase
the number of threads. 

.. admonition:: Parallel efficiency Notice
    :class: important

    If your computer has HyperThreading enabled in BIOS or UEFI, then the number of available cores will be
    double of the physical cores. DO NOT use the virtual cores from HyperThreading since there will be
    significant performance loss. Check the handbook of your CPU for the number of physical cores.

Visualization and I/O
---------------------

If MPI-based parallelization is enabled, either in pure MPI or hybrid MPI+OpenMP mode, special care
should be taken to output and plotting part of the job script. These operations should be performed
on the master process only, otherwise the output will mess up or files get corrupted, since all the
processes will try to modify the same file or plotting the same data. The built-in IO operations of
``tbplas-cpp`` is MPI-aware, i.e., they are safe when MPI is enabled. But for user-define IO and
visualization, remember to check the rank of MPI process before action. The solver, analyzer and
visualizer classes offer an ``is_master`` attribute to detect the master process, whose usage is
demonstrated in ``tbplas-cpp/samples/speedtest``. Taking the ``test_diag_dyn_pol`` function as
example:

.. code-block:: python
    :linenos:

    def test_diag_dyn_pol():
        # ... ...

        if lind.is_master:
            timer.report_total_time()
            vis = tb.Visualizer()
            vis.plot_xy(omegas/t, -dyn_pol[0].imag*t*a**2, color="b")
            vis.plot_xy(omegas, epsilon[1].real, color="r")

            # calc_epsilon DOES NOT save data as C++ counterpart. We have to save
            # the data manually.
            save_xy(f"{lind.config.prefix}_epsilon.dat", omegas, epsilon.T)

The time usage report, visualization and user-defined IO are limited to the master-process only,
by calling the ``is_master`` attribute of :class:`.Lindhard` class.

Example scripts for SLURM
-------------------------

If you are using a super computer with queuing system like ``SLURM``, ``PBS`` or ``LSF``, then you
need another batch script for submitting the job. Contact the administrator of the super computer
for help on preparing the script.

Here we provide two batch scripts for the ``SLURM`` queing system as examples. ``SLURM`` has the
following options for specifying parallelization details:

* nodes: number of nodes for the job
* ntasks-per-node: number of MPI processes to spawn on each node
* cpus-per-task: number of OpenMP threads for each MPI process

Suppose that we are going to use 4 initial conditions and 1 node. The node has 2 CPUs with 16
cores per CPU. The number of MPI processes should be either 1, 2, 4, and the number of OpenMP
threads is 32, 16, 8, respectively. We will use 2 processes * 16 threads. The batch script is
as following:

.. code-block:: bash

    #! /bin/bash
    #SBATCH --partition=cpu_2d
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=2
    #SBATCH --cpus-per-task=16
    #SBATCH --job-name=test_mpi
    #SBATCH --time=24:00:00
    #SBATCH --output=slurm-%j.out
    #SBATCH --error=slurm-%j.err

    # Load modules
    module load tbplas

    # Set number of threads
    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
    export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

    # Change to working directory and run the job
    cd $SLURM_SUBMIT_DIR
    mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py

Here we assume submitting to the ``cpu_2d`` partition. Since we are going to use 1 node, we set ``nodes``
to 1. For each node 2 MPI processes will be spawned, so ``ntasks-per-node`` is set to 2. There are 32
physical cores on the node, so ``cpus-per-task`` is set to 16.

If you want pure OpenMP parallelization, here is another example:

.. code-block:: bash

    #! /bin/bash
    #SBATCH --partition=cpu_2d
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=32
    #SBATCH --job-name=test_omp
    #SBATCH --time=24:00:00
    #SBATCH --output=slurm-%j.out
    #SBATCH --error=slurm-%j.err

    # Load modules
    module load tbplas

    # Set number of threads
    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
    export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

    # Change to working directory and run the job
    cd $SLURM_SUBMIT_DIR
    mpirun -genv FI_PROVIDER=mlx python ./test_mpi.py

In this script the number of processes is set to 1, and the number of threads per process is set to
the total number of physical cores.