Lambda Computational Cluster

 

Permalink to this page: https://hpc.cswp.cs.technion.ac.il/lambda

Please read this tutorial through, even if you’ve worked on Rishon cluster before.
This manual will be updated from time to time and the Troubleshooting section will be updated with solutions to common problems encountered by users.
For technical support contact a10n[at]cs.technion.ac.il.

Usage request form for student projects: [docx]

Table of Contents

 

What is Lambda?

Lambda is a Slurm based computational cluster. It is a group of servers that provide computational resources to the faculty’s undergraduate students.

What is Slurm?

As the name suggests, Slurm Workload Manager is a software that allocates resources for computational jobs based on predefined settings.
Contrary to regular (or local) program execution, programs (jobs) are launched on the Login (or Controller) server and are distributed to one or more of the physical servers (Nodes). SLURM is the software that helps defining and executing these jobs, as well as managing users, permissions and resource allocation. It helps tracking and displaying job details as well.

Hardware

The cluster is comprised of one Controller and several Nodes.
The Controller is a virtual machine named ‘lambda’, to which users log in and from which they launch jobs.
The Nodes are physical machines named lambda1 through 5, on which the jobs run. Although not identical, each Node has two Xeon CPUs, several Nvidia GPUs and a considerable amount of RAM.
SSH connections are only performed to the Controller. Users cannot connect directly to the nodes.

Software

Both controller and nodes run Ubuntu 18.04 server and the latest version of Slurm.
Packages installed on the nodes:

  • Nvidia Drivers v450.66
  • CUDA v10.2.89
  • cuDNN v7.6.5
  • OpenMPI v4.0.4
  • OpenGL and xvfb
  • Python 2.7.17 and 3.6.9

User packages

You would most likely need to install additional packages for your own work.
Creating a virtual environment using Miniconda is supported.
pip3 is installed and can be used to install user packages (using the ‑‑user argument). See the User Guide for instructions, specifically the User Installs section. Note that Python 2.7 has reached End of Life, so pip (for Python 2.x) is not installed although Python 2.7.17 is.

 

top

Using Slurm

Where to start?

First of all, you’ll need to receive log in and work permissions on Lambda. If you’re part of a course – your TA in charge will take care of that. If you’ve enrolled on a project which requires the use of the cluster – ask your project supervisor or lab engineer to contact the IT team and ask for usage permissions for you and your partner. The request must include your @campus email addresses, as those are used for authentication.

Note: Lambda is not available from outside the Technion. If you’re working outside the campus, you will need to connect to the Technion’s network via VPN as described here: English | Hebrew

After receiving login permissions, SSH to ‘lambda.cs.technion.ac.il’:
ssh ‑X <username>@lambda.cs.technion.ac.il
If you’re connected via VPN and you receive an Unknown Host error, use:
ssh -X <username>@132.68.39.159

Your username is your @campus.technion.ac.il username and the password is the same one you use to log in to that account.

After logging in, you’ll be able to use these command:

sinfo – shows the state of each node:

    • idle: No job currently runs on the node.
    • mix: Jobs currently run on this node.
    • down: Node error.
    • drain: The node waits for its jobs to finish and won’t accept new jobs (usually before boot).
    • boot: The node is rebooting (usually after an update).

squeue – show the currently running and pending jobs. Use ‘squeue -u $USER’ to view only your jobs.

Job states (ST):

      • R: The job is running.
      • PD: The job is pending (waiting in queue to run).
      • CG: The job is completing and will be finished soon.

If the job is pending (PD), the NODELIST(REASON) column will indicate why:

      • Priority: A higher priority job is pending execution.
      • Resources: The job has the highest priority and is waiting for resources.
      • QOS*Limit: You have the maximum number of jobs running. The job will launch when another one finishes. Refer to the Job and resource limits section.
      • launch failed requeued held: Job launch failed due to an unknown reason. Use scontrol to requeue the job. Contact us if the problem persists.

scontrol – used to control jobs. You can only control your jobs.

    • scontrol show job <id>: show details for job with id <id> (JOBID column in squeue).
    • scontrol requeue <id>: requeue job with id <id>

scancel – cancels a job. You can only cancel your jobs.

    • scancel job <id>

Your home folder is /home/<username>, is shared across the network and is the same folder on all the nodes. If you’re coming from Rishon cluster – your home folder is the same.
Important: Although the storage system is fault-tolerant, backup your files regularly to an external location. Make sure you have a backup when you finish a course or a project – home folders of inactive users will be deleted without prior notice.

Note: Although using SSH via a command shell (on Windows, Linux or Mac) is possible, you may want to take advantage of free graphical clients such as MobaXterm which also provides an SFTP file explorer to enable easy to use drag-and-drop file transfer to and from the server.

How to launch jobs

Slurm has two commands which you can use to launch jobs: srun and sbatch.
srun is straightforward: srun <arguments> <program>
You can use various arguments to control the resource allocation of your job, as well as other settings your job requires.
You can find the full list of available arguments here. Some of the most common arguments are:
-c # – allocate # CPUs per task.
‑‑gres=gpu:# – allocate # GPUs.
‑‑mpi=<mpi_type> – define the type of mpi to use. The default (when isn’t explicitly specified) is pmi2.
‑‑pty – run job in pseudo terminal mode. Useful when running bash.

Examples:
1. To get a shell with two GPUs, run:
srun ‑‑gres=gpu:2 ‑‑pty /bin/bash
run ‘nvidia-smi’ to verify the job received two GPUs.
2. Run the script ‘script.py’ using python3:
srun python3 script.py

sbatch lets you use a Slurm script to run jobs. You can find more information here.

 

top

Workload management

Slurm uses a Fair Share queueing system. The system assigns job priority based on cluster load and recent user activity in order to give every user a fair chance to execute jobs.
You will only notice the queuing system when the cluster is full (usually on the week preceding courses’ exercise submission deadline).

Priority calculation

This is a rough description of the priority calculation:
Each resource has a billing “cost”. When a job completes, its total cost is calculated using the allocated resources’ cost and the job’s run time. The user is then “billed” for that cost.
When the user runs another job, it enters the queue. Its priority is determined relative to the queued jobs – the lower the user’s bill, the higher the priority his job gets.
There are some other parameters used for calculating priority. One of them is queued time – the longer a job waits in the queue, the higher its priority.
Another layer of Fair Share is the Account system: Each course has an account (usually the course’s number). Projects are all in the same account. Billing is also calculated across accounts. Thus, if course A has a lower total billing cost (combined billing costs of all jobs executed from that account) than course B, jobs queued from course A will get priority over course B’s jobs.

Some notes:

  • Cost is calculated for allocated resources, not used resources: If you ask for two GPUs, for example, and only use one – you will be billed for two GPUs. That’s one reason you shouldn’t overschedule resources.
  • Currently, only CPUs and GPUs have a ‘cost’.
  • There may be situations where user X from course A has a higher billing than user Y from course B, and yet the former’s jobs receive a higher priority than the latter. This is the desired behavior to avoid the cluster from being choked by a specific course.
  • Billing costs decay over time.
  • Billing cost is for priority calculation only. You won’t be charged real money.
  • As mentioned at the beginning of this section – you will only notice the queueing system when the cluster is full and there aren’t enough available resources to allocate to your job.

Partitions, accounts and QoS

A partition is a group of nodes. Slurm allows defining different partitions for micromanaging different types of jobs. We don’t need this functionality and don’t use it. Only one partition is defined (“all”) and is the default for everyone.

Quality of Service (QoS) is another control group entity that helps the Fair Share queueing system. We use account-specific QoS. Don’t try to change your job’s QoS – it won’t work (by design).

Accounts are groups of users. Each course has its own account. All projects belong to a single account. A user can belong to more than one account (if they enroll in more than one course or project).
Every user also has a Default Account which is chosen to run a job when no account is specified.
If you’re enrolled in more than one course (or a course and a project) and have received permission to use Lambda on both, your user is available on more than one account. Generally, the last course you’ve enrolled in will be your default account. Because every account has its own resource limits and billing, you can use a different account for each job you launch, in order to make full use of your resource permissions:
Run ‘sacctmgr show user $USER withassoc’ to view which accounts your user belongs to.
Use the ‘‑‑account=<account>’ or ‘-a <account>‘ arguments to choose which account the job runs on (e.g. srun -a projects ‑‑pty /bin/bash).

 

top

Job and resource limits

In order to avoid choking the cluster and to allow everyone to run their job, job and resource limits are being imposed. These are account-wide, and could be different for different accounts.
The limits are defined by the QoS. Only one QoS is assigned to each account and is set default.
To view the job and resource limits for an account:
Find the relevant QoS using ‘sacctmgr show user $USER withassoc
Show QoS details using ‘sacctmgr show qos <QoS>’ (replace <QoS> with the relevant QoS).

Effective job limits

These will be modified over time according to resource availability and needs:

Courses:

  • Maximum number of running jobs per user: 1
  • Maximum combined running and queued jobs per user: 5
  • Maximum running time per job: 1 day
  • Maximum allowed GPUs to be allocated per job: 1
  • Total maximum allowed GPUs to be allocated per user concurrently: 1

Projects:

  • Maximum number of running jobs per user: 3
  • Maximum combined running and queued jobs per user: 10
  • Maximum running time per job: 7 days
  • Total maximum allowed GPUs to be allocated per user concurrently: 3

 

top

Code of Conduct

Priority algorithms and resource limits have their limitations. Sharing computational resources requires every user to play nice and fair. Be mindful of the fact that other students are using the system and on a typical semester – around 300 students may have permissions on Lambda at a time.
Please follow these guidelines so that everyone can have a positive experience:

  • Don’t abuse the queue: Even if there are no limits on the number of jobs you can queue, don’t overflow it.
  • Don’t abuse loopholes: No system is perfect and no system is watertight. If you find a scenario on which you can bypass job, resource or queue limits – please report.
  • Close idle jobs: The system cannot tell if you’re currently in front of your PC or just left Jupyter Notebook open and went to sleep – these resources could be used by someone else.
  • Don’t overschedule resources: If you need just one GPU – ask for one GPU. Hogging resources needlessly affects everyone – including you, when calculating priority for future jobs.
  • Prefer small jobs over one massive script: If you can modulate your work – please do. This helps for better job scheduling as well as protects against jobs fails.
  • Be nice: Your work is as important for you as everyone else’s work for them. Use common sense when sending jobs.
  • Don’t wait until the last minute: The cluster tends to be flooded with jobs on the week before a submission deadline. Take that into account when managing your time.
  • Understand workload management: The cluster promises to run every job at a reasonable time – it does not promise to run your job RIGHT NOW. Again: Manage your time accordingly.

 

top

System updates and reboots

As with any computer system, software updates are rolling out steadily. Those include various package, security and driver updates and must be applied on a regular basis in order to maintain system security, stability, usability and performance. These updates often require system reboot (yes, even in Linux).
Node (physical servers, lambda1-5) reboots will enter a DRAINING state upon a reboot request. They will not be allocated with new jobs and will reboot after all running jobs have been completed. After reboot (typically 2-3 minutes), they will automatically return to a ready (IDLE) state. Nodes will be rebooted one-by-one in order to minimize work interruption and to maintain maximum cluster availability.
Controller (lambda) reboots are more complicated – they will not affect running jobs, but will end all SSH sessions immediately. There’s currently no good solution to scheduling these reboots and preventing abruptly closing SSH sessions, and they must be performed during work hours and be monitored in case there’s a problem.
If your session was unexpectedly closed – that is probably why. The typical reboot time is ~1 minute. Your running jobs won’t be affected, but unsaved file changes, for example, will be lost. We’re always looking for ways to improve the system and we’ll update this section when a good solution is found. We’ll try to keep Controller reboots to a minimum.

 

top

Troubleshooting

Problem: Error message “ssh: Could not resolve hostname lambda: Name or service not known”.
Solution: Occurs mainly when connecting from outside the campus. Make sure the VPN is connected and try to SSH to 132.68.39.159 instead.

Problem: Error message “Remote side unexpectedly closed network connection” when trying to upload or download files to or from the server.
Solution: SSH sessions limit is 10. Close other connections and try again.

Problem: Job pending.
Solution: Resource over-scheduling (for the pending job) or cluster load. See the Workload management and Job and resource limits sections.

Problem: Error message “Could not load dynamic library ‘libcudart.so.*'”
Solution: If you receive this error on lambda (the login server) – launch the script on a node using Slurm. lambda is a virtual server, has no GPUs and doesn’t have CUDA installed.
If you receive the error on a node:
Add these lines at the end of the .bashrc file in your home folder (~/.bashrc):
export PATH=”/usr/local/cuda/bin:$PATH”
export LD_LIBRARY_PATH=”/usr/local/cuda/lib64:$LD_LIBRARY_PATH”
Save the file, exit the session (or job) and reconnect. Start bash on a node:
srun –pty /bin/bash
and run this command
nvcc -V
You should get CUDA’s version details.