Software
Using more than one node with GPUs
The following steps illustrate the process of building NAMD 3.0 from the source code. This version is not yet built as a module on Kebnekaise. We will compile NAMD so that has supports for multi-node multi-GPU simulations. The stmv benchmark with 1M atoms performed ~1.3x faster when running it on 2 nodes w.r.t. 1 node case. The present setup could assist you in getting better performance for systems containing >1M atoms, for systems with < 1M the single node GPU NAMD version (installed as a module) is expected to be faster.
1. Download the NAMD version 3.0b6 from the NAMD website. Place the *tar.gz file in your project directory /proj/nobackup/<my-project> and untar the main components:
tar xzf NAMD_3.0b6_Source.tar.gz cd NAMD_3.0b6_Source tar xf charm-v7.0.0.tar
The steps for NAMD installation are in the file notes.txt located in the NAMD_3.0b6_Source directory. The specific steps for Kebnekaise that I followed are written in coming steps.
2. Load the tool chain that will be used:
ml GCC/9.3.0 CUDA/11.0.2 OpenMPI/4.0.3 ml GCCcore/9.3.0 CMake/3.16.4
3. Move to the charm-v7 directory and build Charm++ with MPI and SMP support:
cd charm-v7.0.0/ env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 smp --with-production cd ..
4. Install FFTW and TCL libraries:
tar xzf fftw-linux-x86_64.tar.gz mv linux-x86_64 fftw wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz tar xzf tcl8.5.9-linux-x86_64.tar.gz tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz mv tcl8.5.9-linux-x86_64 tcl mv tcl8.5.9-linux-x86_64-threaded tcl-threaded
5. Install NAMD by using the Charm++ recently built:
./config Linux-x86_64-g++ --with-cuda --charm-arch mpi-linux-x86_64-smp cd Linux-x86_64-g++ make
6. The NAMD executable will be located in /proj/nobackup/<my-project>/NAMD_3.0b6_Source/Linux-x86_64-g++ You can add this location to your $PATH variable to make namd3 available in future sessions.
7. For benchmarking, I downloaded the stmv case mentioned above and used this batch script for running on 2 nodes each containing 2 GPUs:
#!/bin/bash #SBATCH -A project_ID # Your project at HPC2N #SBATCH -J namd # Job name in the queue #SBATCH -t 00:20:00 # Allocated time #SBATCH -N 2 # Number of nodes #SBATCH -c 28 # Request the total number of cores per node #SBATCH -n 2 # Run 1 MPI process per node #SBATCH --gres=gpu:v100:2 # 2 GPUs per node requested #SBATCH --exclusive # necessary when the entire node is allocated # To check the speedup single and multi node NAMD versions, I run first the NAMD version installed as a module # you will need to uncomment the following four lines and change the number of nodes and tasks per node above (-N 1 -n 1) #ml purge > /dev/null 2>&1 #ml GCC/9.3.0 CUDA/11.0.2 OpenMPI/4.0.3 #ml NAMD/2.14-nompi #namd2 +p28 +setcpuaffinity +idlepoll +devices $CUDA_VISIBLE_DEVICES stmv.namd > output_prod.dat # Run the NAMD version that was just built by using 2 nodes with 2 GPUs each ml purge > /dev/null 2>&1 ml GCC/9.3.0 CUDA/11.0.2 OpenMPI/4.0.3 srun -c 28 -n 2 /proj/nobackup/<my-project>/NAMD_3.0b6_Source/Linux-x86_64-g++/namd3 ++ppn 27 +setcpuaffinity stmv.namd > output_prod_2.dat
The performance can be obtained in the day/ns benchmark lines in both output files. Some remarks:
- NAMD should request all cores in the nodes (-c 28) and 1 task needs to be initiated per node (-n 2).
- Just 27 processes per nod (++ppn) are requested which allows an extra communication/management process (to saturate the 28 cores in a node) to be handled by NAMD.