Gipdeep Computational Cluster

Permalink to this page: https://hpc.cswp.cs.technion.ac.il/gipdeep

This manual will be updated from time to time and the Troubleshooting section will be updated with solutions to common problems encountered by users.
For technical support contact the IT department.

Note: Do not copy-paste command from this page into the terminal. Type them manually instead. Encoding issues might cause you to receive slurmstepd: error: execve() errors, specifically when copying hyphens (-).

Table of Contents

 

What is Gipdeep?

Gipdeep is a Slurm based computational cluster. It is a group of servers that provide computational resources to researchers.

What is Slurm?

As the name suggests, Slurm Workload Manager is a software that allocates resources for computational jobs based on predefined settings.
Contrary to regular (or local) program execution, programs (jobs) are launched on the Login (or Controller) server and are distributed to one or more of the physical servers (Nodes). SLURM is the software that helps defining and executing these jobs, as well as managing users, permissions and resource allocation. It helps tracking and displaying job details as well.

Hardware

The cluster is comprised of one Controller and several Nodes.
The Controller is a virtual machine named ‘gipdeep’, to which users log in and from which they launch jobs.
The Nodes are physical machines named gipdeep1, gipdeep2 etc., on which the jobs run.
SSH connections are only performed to the Controller. Users cannot connect directly to the nodes.

Software

Both controller and nodes run Ubuntu 18.04 server and the latest version of Slurm.
Packages installed on the nodes:

  • Nvidia Drivers
  • CUDA v11.1 on nodes with Ampere GPUS (3090, A6000), CUDA v10.2.89 for the other nodes
  • cuDNN v8.1.1 on Ampere nodes, cuDNN v7.6.5 for the the others
  • OpenMPI v4.0.4
  • OpenGL and xvfb
  • Python 2.7.17 and 3.6.9
  • Openslide

User packages

You would most likely need to install additional packages for your own work.
Creating a virtual environment using Miniconda is supported.
Note that you can create Miniconda environment with whichever CUDA and cuDNN versions you need, regardless of the versions installed on the servers.
pip3 is installed and can be used to install user packages (using the ‑‑user argument). See the User Guide for instructions, specifically the User Installs section. Note that Python 2.7 has reached End of Life, so pip (for Python 2.x) is not installed although Python 2.7.17 is.

top

Using Slurm

Where to start?

First of all, you’ll need to receive log in and work permissions on Gipdeep . This is usually done by a request to GIP lab’s engineer, Yaron Honen.

Note: Gipdeep is not available from outside the Technion. If you’re working outside the campus, you will need to connect to the Technion’s network via VPN as described here: English | Hebrew

After receiving login permissions, SSH to ‘gipdeep.cs.technion.ac.il’:
ssh ‑X <username>@gipdeep.cs.technion.ac.il
If you’re connected via VPN and receive an Unknown Host error, use:
ssh -X <username>@132.68.39.112

After logging in, you’ll be able to use these command:

sinfo – shows the state of each node:

    • idle: No job currently runs on the node.
    • mix: Jobs currently run on this node.
    • alloc: All of the node’s CPUs are currently allocated.
    • down: Node error.
    • drain: The node is waiting for its jobs to finish and won’t accept new jobs (usually before boot).
    • boot: The node is rebooting (usually after an update).

squeue – shows currently running and pending jobs. Use ‘squeue -u $USER’ to view only your jobs.

Job states (ST):

      • R: The job is running.
      • PD: The job is pending (waiting in queue to run).
      • CG: The job is completing and will be finished soon.

If the job is pending (PD), the NODELIST(REASON) column will indicate why:

      • Priority: A higher priority job is pending execution.
      • Resources: The job has the highest priority and is waiting for resources.
      • QOS*Limit: You have the maximum number of jobs running. The job will launch when another one finishes. Refer to the Job and resource limits section.
      • launch failed requeued held: Job launch failed due to an unknown reason. Use scontrol to requeue the job. Contact us if the problem persists.

scontrol – used to control jobs. You can only control your jobs.

    • scontrol show job <id>: show details for job with id <id> (JOBID column in squeue).
    • scontrol requeue <id>: requeue job with id <id>

scancel – cancels a job. You can only cancel your jobs.

    • scancel job <id>

Your home folder is /home/<username>, is shared across the network and is the same folder on all the nodes.
Important: Although the storage system is fault-tolerant, backup your files regularly to an external location.

Note: Although using SSH via a command shell (on Windows, Linux or Mac) is possible, you may want to take advantage of free graphical clients such as MobaXterm which also provides an SFTP file explorer to enable easy to use drag-and-drop file transfer to and from the server.

How to launch jobs

Slurm has two commands for launching jobs: srun and sbatch.
srun is straightforward: srun <arguments> <program>
You can use various arguments to control the resource allocation of your job, as well as other settings your job requires.
You can find the full list of available arguments here. Some of the most common arguments are:
-c # – allocate # CPUs per task.
‑‑gres=gpu:# – allocate # GPUs.
‑‑gres=gpu:<type>:# – allocate # GPUs of type <type>.
‑‑mpi=<mpi_type> – define the type of mpi to use. The default (when isn’t explicitly specified) is pmi2.
‑‑pty – run job in pseudo terminal mode. Useful when running bash.
-p <partition> – run job on the selected partition instead of the default one.
-w <node> – run job on a specific node.

Examples:
1. To get a shell with two GPUs, run:
srun ‑‑gres=gpu:2 ‑‑pty /bin/bash
run ‘nvidia-smi’ to verify the job received two GPUs.
2. Run the script ‘script.py’ using python3:
srun python3 script.py

sbatch lets you use a Slurm script to run jobs. You can find more information here.

top

Selecting specific resources

It is possible to select specific nodes and GPU types when launching jobs:

Nodes:
Use the argument -w to select a specific node to run on. You can also specify a list of nodes. For example:
srun -w gipdeep4,gipdeep5 …

Get the list of available nodes and their state using the command sinfo -N.

Run the command snode to list the number of allocated, available and total CPUs and GPUs for every node in the cluster.

GPU type:
You can specify a specific GPU type using the ‑‑gres argument. For example:
srun ‑‑gres=gpu:titanx:2 …

List of GPU types and their codenames:

TeslaP100 – Tesla P100-PCIE-12GB
1080 – GeForce GTX 1080 8GB
1080ti – GeForce GTX 1080 Ti 11GB
titanx – GeForce GTX TITAN X 12GB
2080ti – GeForce RTX 2080 Ti 11GB
3090 – GeForce RTX 3090 24GB

You can check current GPU usage across all nodes using the command sgpu.
Use sgpu_stat <node> to view GPU usage for a specific node, i.e. sgpu_stat gipdeep1.
sgpu shows running processes on GPUs across all nodes. Even if a GPU doesn’t run a process, it doesn’t mean it’s available for scheduling.  A job can allocate resources and not use them.

You can check current RAM and Swap utilization using the command smem.
Use smem_stat <node> to view GPU usage for a specific node, i.e. smem_stat gipdeep1.

top

Workload management

Slurm uses a Fair Share queueing system. The system assigns job priority based on cluster load and recent user activity in order to give every user a fair chance to execute jobs.
You will only notice the queuing system when the cluster is full.

Priority calculation

This is a rough description of the priority calculation:
Each resource has a billing “cost”. When a job completes, its total cost is calculated using the allocated resources’ cost and the job’s run time. The user is then “billed” for that cost.
When the user runs another job, it enters the queue. Its priority is determined relative to the queued jobs – the lower the user’s bill, the higher the priority his job gets.
There are some other parameters used for calculating priority. One of them is queued time – the longer a job waits in the queue, the higher its priority.

Some notes:

  • Cost is calculated for allocated resources, not used resources: If you ask for two GPUs, for example, and only use one – you will be billed for two GPUs. That’s one reason you shouldn’t overschedule resources.
  • Currently, only CPUs and GPUs have a ‘cost’.
  • Billing costs decay over time.
  • Billing cost is for priority calculation only. You won’t be charged real money.
  • As mentioned at the beginning of this section – you will only notice the queueing system when the cluster is full and there aren’t enough available resources to allocate to your job.

Partitions, accounts and QoS

A partition is a logical group of physical nodes. A node can be part of more than one partition. A slurm job will launch on a specific partition, indicated using the -p argument. If a partition is not specified, it will launch on the default partition.
Accounts are groups of users. On Gipdeep, every user belongs to one (and only one) account as is selected by default. The account a user belongs to defines which partitions they can launch their jobs on. You can check your account by logging into Gipdeep and running the command:
sacctmgr show user $USER withassoc format=user,account
Quality of Service (QoS)
is another control group entity that helps the Fair Share queueing system and to define resource limits. Currently, there’s a limit of 2 GPUs per user, defined on the default QoS. You can ignore QoS at this time.

Generally, research is the default account, all is the default partition and normal is the default QoS, and you don’t need to include those when running jobs. Only specific research groups have separate accounts and partitions.

top

Job and resource limits

Users are limited to 4 GPUs and 40 CPUs at any given time, without limit on the number of jobs.

top

Code of Conduct

Priority algorithms and resource limits have their limitations. Sharing computational resources requires every user to play nice and fair. Be mindful of the fact that other researchers are also using the system.
Please follow these guidelines so that everyone can have a positive experience:

  • Don’t abuse the queue: Even if there are no limits on the number of jobs you can queue, don’t overflow it.
  • Don’t abuse loopholes: No system is perfect and no system is watertight. If you find a scenario on which you can bypass job, resource or queue limits – please report.
  • Close idle jobs: The system cannot tell if you’re currently in front of your PC or just left Jupyter Notebook open and went to sleep – these resources could be used by someone else.
  • Don’t overschedule resources: If you need just one GPU – ask for one GPU. Hogging resources needlessly affects everyone – including you, when calculating priority for future jobs.
  • Prefer small jobs over one massive script: If you can modulate your work – please do. This helps for better job scheduling as well as protects against jobs fails.
  • Be nice: Your work is as important for you as everyone else’s work for them. Use common sense when sending jobs.
  • Understand workload management: The cluster promises to run every job at a reasonable time – it does not promise to run your job RIGHT NOW. Manage your time accordingly.

top

System updates and reboots

As with any computer system, software updates are rolling out steadily. Those include various package, security and driver updates and must be applied on a regular basis in order to maintain system security, stability, usability and performance. These updates often require system reboot (yes, even in Linux).
Node (physical servers) reboots will enter a DRAINING state upon a reboot request. They will not be allocated with new jobs and will reboot after all running jobs have been completed. After reboot (typically 2-3 minutes), they will automatically return to a ready (IDLE) state. Nodes will be rebooted one-by-one in order to minimize work interruption and to maintain maximum cluster availability.
Controller (gipdeep) reboots are more complicated – they will not affect running jobs, but will end all SSH sessions immediately. There’s currently no good solution to scheduling these reboots and preventing abruptly closing SSH sessions, and they must be performed during work hours and be monitored in case there’s a problem.
If your session was unexpectedly closed – that is probably why. The typical reboot time is ~1 minute. Your running jobs won’t be affected, but unsaved file changes, for example, will be lost. We’re always looking for ways to improve the system and we’ll update this section when a good solution is found. We’ll try to keep Controller reboots to a minimum.

top

Mailing list

There’s a mailing list for Gipdeep users which is used to send service messages and other important information. In order to register for that list, send us your E-Mail address and full name in English, either via the helpdesk site or to our E-Mail directly. The format is:
<E-Mail> <first_name> <last_name>
e.g. a10n@cs.technion.ac.il Alon Gil-Ad
You can use whatever E-Mail you choose – it doesn’t have to be a Technion E-Mail.
Only cluster admins can send messages to the mailing list, and only important messages will be sent.
It isn’t obligatory to register to the mailing list, but messages will not be communicated by any other means so you’re encouraged to do so.

top

Troubleshooting

Problem: Error message “ssh: Could not resolve hostname gipdeep: Name or service not known”.
Solution: Occurs mainly when connecting from outside the campus. Make sure the VPN is connected and try to SSH to 132.68.39.112 instead.

Problem: Error message “Remote side unexpectedly closed network connection” when trying to upload or download files to or from the server.
Solution: SSH sessions limit is 10. Close other connections and try again.

Problem: Job pending.
Solution: Resource over-scheduling (for the pending job) or cluster load. See the Workload management and Job and resource limits sections.

Problem: Error message “Could not load dynamic library ‘libcudart.so.*’”
Solution: If you receive this error on gipdeep(the login server) – launch the script on a node using Slurm. gipdeep is a virtual server, has no GPUs and doesn’t have CUDA installed.
If you receive the error on a node:
Add these lines at the end of the .bashrc file in your home folder (~/.bashrc):
export PATH=”/usr/local/cuda/bin:$PATH”
export LD_LIBRARY_PATH=”/usr/local/cuda/lib64:$LD_LIBRARY_PATH”
Save the file, exit the session (or job) and reconnect. Start bash on a node:
srun –pty /bin/bash
and run this command
nvcc -V
You should get CUDA’s version details.