Parallelization

TBPLaS features a hybrid MPI+OpenMP parallelization for the evaluation of band structure and DOS from exact-diagonalization, response properties from Lindhard function, topological invariant \(\mathbb{Z}_2\) and TBPM calculations. Both MPI and OpenMP can be switched on/off separately on demand, while pure OpenMP mode is enabled by default.

The number of OpenMP threads is controlled by the OMP_NUM_THREADS environment variable. If TBPLaS has been compiled with MKL support, then the MKL_NUM_THREADS environment variable will also take effect. If none of the environment variables has been set, OpenMP will make use of all the CPU cores on the computing node. To switch off OpenMP, set the environment variables to 1. On the contrary, MPI-based parallelization is disabled by default, but can be easily enabled with a single option. The calc_bands and calc_dos functions of PrimitiveCell and Sample classes, the initialization functions of Lindhard, Z2, Solver and Analyzer classes all accept an argument named enable_mpi whose default value is taken to be False. If set to True, MPI-based parallelization is turned on, provided that the MPI4PY package has been installed. Hybrid MPI+OpenMP parallelization is achieved by enabling MPI and OpenMP simultaneously. The number of processes is controlled by the MPI launcher, which receives arguments from the command line, environment variables or configuration file. The user is recommended to check the manual of job queuing system on the computer for properly setting the environment variables and invoking the MPI launcher. For computers without a queuing system, e.g., laptops, desktops and standalone workstations, the MPI launcher should be mpirun or mpiexec, while the number of processes is controlled by the -np command line option.

The optimal parallelization configuration, i.e., the numbers of MPI processes and OpenMP threads, depend on the hardware, the model size and the type of calculation. Generally speaking, matrix diagonalization for a single \(\mathbf{k}\)-point is poorly parallelized over threads. But the diagonalization for multiple \(\mathbf{k}\)-points can be efficiently parallelized over processes. Therefore, for band structure and DOS calculations, as well as response properties from Lindhard function and topological invariant from Z2, it is recommended to run in pure MPI-mode by setting the number of MPI processes to the total number of allocated CPU cores and the number of OpenMP threads to 1. However, MPI-based parallelization uses more RAM since every process has to keep a copy of the wave functions and energies. So, if the available RAM imposes a limit, try to use less processes and more threads. Anyway, the product of the numbers of processes and threads should be equal to the number of allocated CPU cores. For example, if you have allocated 16 cores, then you can try 16 processes \(\times\) 1 thread, 8 processes \(\times\) 2 threads, 4 processes \(\times\) 4 threads, etc. For TBPM calculations, the number of random initial wave functions should be divisible by the number of processes. For example, if you are going to consider 16 initial wave functions, then the number of processes should be 1, 2, 4, 8, or 16. The number of threads should be set according to the number of processes. Again, if the RAM size is a problem, try to decrease the number of processes and increase the number of threads. Note that if your computer has HyperThreading enabled in BIOS or UEFI, then the number of available cores will be double of the physical cores. DO NOT use the virtual cores from HyperThreading since there will be significant performance loss. Check the handbook of your CPU for the number of physical cores.

If MPI-based parallelization is enabled, either in pure MPI or hybrid MPI+OpenMP mode, special care should be taken to output and plotting part of the job script. These operations should be performed on the master process only, otherwise the output will mess up or files get corrupted, since all the processes will try to modify the same file or plotting the same data. This situation is avoided by checking the rank of the process before action. The Lindhard, Z2, Solver, Analyzer and Visualizer classes all offer an is_master attribute to detect the master process, whose usage will be demonstrated in the following sections.

Last but not least, we have to mention that all the calculations in previous tutorials can be run in either interactive or batch mode. You can input the script line-by-line in the terminal, or save it to a file and pass the file to the Python interpreter. However, MPI-based parallelization supports only the batch mode, since there is no possibility to input anything in the terminal for multiple processes in one time. In the following sections, we assume the script file to be test_mpi.py. A common head block of the script is given in Band structure and DOS and will not be explicitly repeated in subsequent sections.

Band structure and DOS

We demonstrate the usage of calc_bands and calc_dos in parallel mode by calculating the band structure and DOS of a \(12\times12\times1\) graphene sample. Procedure shown here is also valid for the primitive cell. To enable MPI-based parallelization, we need to save the script to a file, for instance, test_mpi.py. The head block of this file should be:

#! /usr/bin/env python

import numpy as np
import tbplas as tb


timer = tb.Timer()
vis = tb.Visualizer(enable_mpi=True)

where the first line is a magic line declaring that the script should be interpreted by the Python program. In the following lines we import the necessary packages. To record and report the time usage, we need to create a timer from the Timer class. We also need a visualizer for plotting the results, where the enable_mpi argument is set to True during initialization. This head block also is essential for other examples in subsequent sections.

For convenience, we will not build the primitive cell from scratch, but import it from the materia repository with the make_graphene_diamond() function:

cell = tb.make_graphene_diamond()

Then we build the sample by:

sample = tb.Sample(tb.SuperCell(cell, dim=(12, 12, 1), pbc=(True, True, False)))

The evaluation of band structure in parallel mode is similar to the serial mode, which also involves generating the \(\mathbf{k}\)-path and calling calc_bands. The only difference is that we need to set the enable_mpi argument to True when calling calc_bands:

k_points = np.array([
    [0.0, 0.0, 0.0],
    [2./3, 1./3, 0.0],
    [1./2, 0.0, 0.0],
    [0.0, 0.0, 0.0],
])
k_path, k_idx = tb.gen_kpath(k_points, [40, 40, 40])
timer.tic("band")
k_len, bands = sample.calc_bands(k_path, enable_mpi=True)
timer.toc("band")
vis.plot_bands(k_len, bands, k_idx, k_label)
if vis.is_master:
    timer.report_total_time()

The tic and toc functions begin and end the recording of time usage, which receive a string as the argument for tagging the record. The visualizer is aware of the parallel environment, so no special treatment is needed when plotting the results. Finally, the time usage is reported with the report_total_time function on the master process only, by checking the is_master attribute of the visualizer.

We run test_mpi.py by:

$ export OMP_NUM_THREADS=1
$ mpirun -np 1 ./test_mpi.py

With the environment variable OMP_NUM_THREADS set to 1, the script will run in pure MPI-mode. We invoke 1 MPI process by the -np option of the MPI launcher mpirun. The output should look like:

band :      11.03s

So, the evaluation of bands takes 11.03 seconds on 1 process. We try with more processes:

$ mpirun -np 2 ./test_mpi.py
    band :       5.71s
$ mpirun -np 4 ./test_mpi.py
    band :       2.93s

Obviously, the time usage scales reversely with the number of processes. Detailed discussion on the time usage and speedup under different parallelization configurations will be discussed in ref. 4 of Background.

Evaluation of DOS can be parallelized in the same approach, by setting the enable_mpi argument to True:

k_mesh = tb.gen_kmesh((20, 20, 1))
timer.tic("dos")
energies, dos = sample.calc_dos(k_mesh, enable_mpi=True)
timer.toc("dos")
vis.plot_dos(energies, dos)
if vis.is_master:
    timer.report_total_time()

The script can be run in the same approach as evaluating the band structure.

Response properties from Lindhard function

To evaluate response properties in parallel mode, simply set the enable_mpi argument to True when creating the Lindhard calculator:

lind = tb.Lindhard(cell=cell, energy_max=10.0, energy_step=2048,
                   kmesh_size=(600, 600, 1), mu=0.0, temperature=300.0, g_s=2,
                   back_epsilon=1.0, dimension=2, enable_mpi=True)

Subsequent calls to the functions of Lindhard class does not need further special treatment. For example, the optical conductivity can be evaluated in the same approach as in serial mode:

timer.tic("ac_cond")
omegas, ac_cond = lind.calc_ac_cond(component="xx")
timer.toc("ac_cond")
vis.plot_xy(omegas, ac_cond)
if vis.is_master:
    timer.report_total_time()

Topological invariant from Z2

The evaluation of phases \(\theta_m^D\) can be paralleled in the same approach as response functions:

z2 = tb.Z2(cell, num_occ=10, enable_mpi=True)
timer.tic("z2")
kb_array, phases = z2.calc_phases(ka_array, kb_array, kc)
timer.toc("z2")
vis.plot_phases(kb_array, phases / pi)
if vis.is_master:
    timer.report_total_time()

where we only need to set enable_mpi argument to True when creating the Z2 instance.

Properties from TBPM

TBPM calculations in parallel mode are similar to the evaluation of response functions. The user only needs to set the enable_mpi argument to True. To make the time usage noticeable, we build a larger sample first:

sample = tb.Sample(tb.SuperCell(cell, dim=(240, 240, 1), pbc=(True, True, False)))

Then we create the configuration, solver and analyzer, with the argument enable_mpi=True:

sample.rescale_ham(9.0)
config = tb.Config()
config.generic["nr_random_samples"] = 4
config.generic["nr_time_steps"] = 256
solver = tb.Solver(sample, config, enable_mpi=True)
analyzer = tb.Analyzer(sample, config, enable_mpi=True)

Correlation function can be obtained and analyzed in the same way as in serial mode:

timer.tic("corr_dos")
corr_dos = solver.calc_corr_dos()
timer.toc("corr_dos")
energies, dos = analyzer.calc_dos(corr_dos)
vis.plot_dos(energies, dos)
if vis.is_master:
    timer.report_total_time()

Example scripts for SLURM

If you are using a super computer with queuing system like SLURM, PBS or LSF, then you need another batch script for submitting the job. Contact the administrator of the super computer for help on preparing the script.

Here we provide two batch scripts for the SLURM queing system as examples. SLURM has the following options for specifying parallelization details:

nodes: number of nodes for the job
ntasks-per-node: number of MPI processes to spawn on each node
cpus-per-task: number of OpenMP threads for each MPI process

Suppose that we are going to use 4 initial conditions and 1 node. The node has 2 CPUs with 8 cores per CPU. The number of MPI processes should be either 1, 2, 4, and the number of OpenMP threads is 16, 8, 4, respectively. We will use 2 processes * 8 threads. The batch script is as following:

#! /bin/bash
#SBATCH --account=alice
#SBATCH --partition=hpib
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH --job-name=test_mpi
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

# Load modules
module load mpi4py tbplas

# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
srun --mpi=pmi2 python ./test_mpi.py

Here we assume the user name to be alice, and we are submitting to the hpib partition. Since we are going to use 1 node, we set nodes to 1. For each node 2 MPI processes will be spawned, so ntasks-per-node is set to 2. There are 16 physical cores on the node, so cpus-per-task is set to 8.

If you want pure OpenMP parallelization, here is another example:

#! /bin/bash
#SBATCH --account=alice
#SBATCH --partition=hpib
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --job-name=test_omp
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

# Load modules
module load tbplas

# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
srun python ./test_omp.py

In this script the number of processes is set to 1, and the number of threads per process is set to the total number of physical cores. Don’t forget to remove enable_mpi=True when creating the solver and analyzer, in order to skip unnecessary MPI initialization.