Parallelization
TBPLaS features a hybrid MPI+OpenMP parallelization for the evaluation of band structure and DOS from exact-diagonalization, response properties from Lindhard function, topological invariant \(\mathbb{Z}_2\) and TBPM calculations. Both MPI and OpenMP can be switched on/off separately on demand, while pure OpenMP mode is enabled by default.
The number of OpenMP threads is controlled by the OMP_NUM_THREADS
environment variable. If
TBPLaS has been compiled with MKL support, then the MKL_NUM_THREADS
environment variable will
also take effect. If none of the environment variables has been set, OpenMP will make use of all
the CPU cores on the computing node. To switch off OpenMP, set the environment variables to 1.
On the contrary, MPI-based parallelization is disabled by default, but can be easily enabled with
a single option. The calc_bands
and calc_dos
functions of PrimitiveCell
and
Sample
classes, the initialization functions of Lindhard
, Z2
,
Solver
and Analyzer
classes all accept an argument named enable_mpi
whose
default value is taken to be False
. If set to True
, MPI-based parallelization is turned on,
provided that the MPI4PY
package has been installed. Hybrid MPI+OpenMP parallelization is
achieved by enabling MPI and OpenMP simultaneously. The number of processes is controlled by the
MPI launcher, which receives arguments from the command line, environment variables or
configuration file. The user is recommended to check the manual of job queuing system on the
computer for properly setting the environment variables and invoking the MPI launcher. For
computers without a queuing system, e.g., laptops, desktops and standalone workstations, the MPI
launcher should be mpirun
or mpiexec
, while the number of processes is controlled by the
-np
command line option.
The optimal parallelization configuration, i.e., the numbers of MPI processes and OpenMP threads, depend on the hardware, the model size and the type of calculation. Generally speaking, matrix diagonalization for a single \(\mathbf{k}\)-point is poorly parallelized over threads. But the diagonalization for multiple \(\mathbf{k}\)-points can be efficiently parallelized over processes. Therefore, for band structure and DOS calculations, as well as response properties from Lindhard function and topological invariant from Z2, it is recommended to run in pure MPI-mode by setting the number of MPI processes to the total number of allocated CPU cores and the number of OpenMP threads to 1. However, MPI-based parallelization uses more RAM since every process has to keep a copy of the wave functions and energies. So, if the available RAM imposes a limit, try to use less processes and more threads. Anyway, the product of the numbers of processes and threads should be equal to the number of allocated CPU cores. For example, if you have allocated 16 cores, then you can try 16 processes \(\times\) 1 thread, 8 processes \(\times\) 2 threads, 4 processes \(\times\) 4 threads, etc. For TBPM calculations, the number of random initial wave functions should be divisible by the number of processes. For example, if you are going to consider 16 initial wave functions, then the number of processes should be 1, 2, 4, 8, or 16. The number of threads should be set according to the number of processes. Again, if the RAM size is a problem, try to decrease the number of processes and increase the number of threads. Note that if your computer has HyperThreading enabled in BIOS or UEFI, then the number of available cores will be double of the physical cores. DO NOT use the virtual cores from HyperThreading since there will be significant performance loss. Check the handbook of your CPU for the number of physical cores.
If MPI-based parallelization is enabled, either in pure MPI or hybrid MPI+OpenMP mode, special care
should be taken to output and plotting part of the job script. These operations should be performed
on the master process only, otherwise the output will mess up or files get corrupted, since all the
processes will try to modify the same file or plotting the same data. This situation is avoided by
checking the rank of the process before action. The Lindhard
, Z2
,
Solver
, Analyzer
and Visualizer
classes all offer an is_master
attribute to detect the master process, whose usage will be demonstrated in the following sections.
Last but not least, we have to mention that all the calculations in previous tutorials can be run
in either interactive or batch mode. You can input the script line-by-line in the terminal, or save
it to a file and pass the file to the Python interpreter. However, MPI-based parallelization
supports only the batch mode, since there is no possibility to input anything in the terminal for
multiple processes in one time. In the following sections, we assume the script file to be
test_mpi.py
. A common head block of the script is given in Band structure and DOS and will not be
explicitly repeated in subsequent sections.
Band structure and DOS
We demonstrate the usage of calc_bands
and calc_dos
in parallel mode by calculating the
band structure and DOS of a \(12\times12\times1\) graphene sample. Procedure shown here is also
valid for the primitive cell. To enable MPI-based parallelization, we need to save the script to a
file, for instance, test_mpi.py
. The head block of this file should be:
#! /usr/bin/env python
import numpy as np
import tbplas as tb
timer = tb.Timer()
vis = tb.Visualizer(enable_mpi=True)
where the first line is a magic line declaring that the script should be interpreted by the Python
program. In the following lines we import the necessary packages. To record and report the time
usage, we need to create a timer from the Timer
class. We also need a visualizer for
plotting the results, where the enable_mpi
argument is set to True
during initialization.
This head block also is essential for other examples in subsequent sections.
For convenience, we will not build the primitive cell from scratch, but import it from the materia
repository with the make_graphene_diamond()
function:
cell = tb.make_graphene_diamond()
Then we build the sample by:
sample = tb.Sample(tb.SuperCell(cell, dim=(12, 12, 1), pbc=(True, True, False)))
The evaluation of band structure in parallel mode is similar to the serial mode, which also
involves generating the \(\mathbf{k}\)-path and calling calc_bands
. The only difference is
that we need to set the enable_mpi
argument to True
when calling calc_bands
:
k_points = np.array([
[0.0, 0.0, 0.0],
[2./3, 1./3, 0.0],
[1./2, 0.0, 0.0],
[0.0, 0.0, 0.0],
])
k_path, k_idx = tb.gen_kpath(k_points, [40, 40, 40])
timer.tic("band")
k_len, bands = sample.calc_bands(k_path, enable_mpi=True)
timer.toc("band")
vis.plot_bands(k_len, bands, k_idx, k_label)
if vis.is_master:
timer.report_total_time()
The tic
and toc
functions begin and end the recording of time usage, which receive a string
as the argument for tagging the record. The visualizer is aware of the parallel environment, so no
special treatment is needed when plotting the results. Finally, the time usage is reported with the
report_total_time
function on the master process only, by checking the is_master
attribute
of the visualizer.
We run test_mpi.py
by:
$ export OMP_NUM_THREADS=1
$ mpirun -np 1 ./test_mpi.py
With the environment variable OMP_NUM_THREADS
set to 1, the script will run in pure MPI-mode.
We invoke 1 MPI process by the -np
option of the MPI launcher mpirun
. The output should
look like:
band : 11.03s
So, the evaluation of bands takes 11.03 seconds on 1 process. We try with more processes:
$ mpirun -np 2 ./test_mpi.py
band : 5.71s
$ mpirun -np 4 ./test_mpi.py
band : 2.93s
Obviously, the time usage scales reversely with the number of processes. Detailed discussion on the time usage and speedup under different parallelization configurations will be discussed in ref. 4 of Background.
Evaluation of DOS can be parallelized in the same approach, by setting the enable_mpi
argument
to True
:
k_mesh = tb.gen_kmesh((20, 20, 1))
timer.tic("dos")
energies, dos = sample.calc_dos(k_mesh, enable_mpi=True)
timer.toc("dos")
vis.plot_dos(energies, dos)
if vis.is_master:
timer.report_total_time()
The script can be run in the same approach as evaluating the band structure.
Response properties from Lindhard function
To evaluate response properties in parallel mode, simply set the enable_mpi
argument to
True
when creating the Lindhard calculator:
lind = tb.Lindhard(cell=cell, energy_max=10.0, energy_step=2048,
kmesh_size=(600, 600, 1), mu=0.0, temperature=300.0, g_s=2,
back_epsilon=1.0, dimension=2, enable_mpi=True)
Subsequent calls to the functions of Lindhard
class does not need further special
treatment. For example, the optical conductivity can be evaluated in the same approach as in serial
mode:
timer.tic("ac_cond")
omegas, ac_cond = lind.calc_ac_cond(component="xx")
timer.toc("ac_cond")
vis.plot_xy(omegas, ac_cond)
if vis.is_master:
timer.report_total_time()
Topological invariant from Z2
The evaluation of phases \(\theta_m^D\) can be paralleled in the same approach as response functions:
z2 = tb.Z2(cell, num_occ=10, enable_mpi=True)
timer.tic("z2")
kb_array, phases = z2.calc_phases(ka_array, kb_array, kc)
timer.toc("z2")
vis.plot_phases(kb_array, phases / pi)
if vis.is_master:
timer.report_total_time()
where we only need to set enable_mpi
argument to True
when creating the Z2
instance.
Properties from TBPM
TBPM calculations in parallel mode are similar to the evaluation of response functions. The user
only needs to set the enable_mpi
argument to True
. To make the time usage noticeable, we
build a larger sample first:
sample = tb.Sample(tb.SuperCell(cell, dim=(240, 240, 1), pbc=(True, True, False)))
Then we create the configuration, solver and analyzer, with the argument enable_mpi=True
:
sample.rescale_ham(9.0)
config = tb.Config()
config.generic["nr_random_samples"] = 4
config.generic["nr_time_steps"] = 256
solver = tb.Solver(sample, config, enable_mpi=True)
analyzer = tb.Analyzer(sample, config, enable_mpi=True)
Correlation function can be obtained and analyzed in the same way as in serial mode:
timer.tic("corr_dos")
corr_dos = solver.calc_corr_dos()
timer.toc("corr_dos")
energies, dos = analyzer.calc_dos(corr_dos)
vis.plot_dos(energies, dos)
if vis.is_master:
timer.report_total_time()
Example scripts for SLURM
If you are using a super computer with queuing system like SLURM
, PBS
or LSF
, then you
need another batch script for submitting the job. Contact the administrator of the super computer
for help on preparing the script.
Here we provide two batch scripts for the SLURM
queing system as examples. SLURM
has the
following options for specifying parallelization details:
nodes: number of nodes for the job
ntasks-per-node: number of MPI processes to spawn on each node
cpus-per-task: number of OpenMP threads for each MPI process
Suppose that we are going to use 4 initial conditions and 1 node. The node has 2 CPUs with 8 cores per CPU. The number of MPI processes should be either 1, 2, 4, and the number of OpenMP threads is 16, 8, 4, respectively. We will use 2 processes * 8 threads. The batch script is as following:
#! /bin/bash
#SBATCH --account=alice
#SBATCH --partition=hpib
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH --job-name=test_mpi
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
# Load modules
module load mpi4py tbplas
# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
srun --mpi=pmi2 python ./test_mpi.py
Here we assume the user name to be alice
, and we are submitting to the hpib
partition.
Since we are going to use 1 node, we set nodes
to 1. For each node 2 MPI processes will be
spawned, so ntasks-per-node
is set to 2. There are 16 physical cores on the node, so
cpus-per-task
is set to 8.
If you want pure OpenMP parallelization, here is another example:
#! /bin/bash
#SBATCH --account=alice
#SBATCH --partition=hpib
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --job-name=test_omp
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
# Load modules
module load tbplas
# Set number of threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Change to working directory and run the job
cd $SLURM_SUBMIT_DIR
srun python ./test_omp.py
In this script the number of processes is set to 1, and the number of threads per process is set to
the total number of physical cores. Don’t forget to remove enable_mpi=True
when creating the
solver and analyzer, in order to skip unnecessary MPI initialization.