Newton Computational Cluster

Permalink to this page: https://hpc.cswp.cs.technion.ac.il/newton

This manual will be updated from time to time and the Troubleshooting section will be updated with solutions to common problems encountered by users.
For technical support contact the IT department.

Table of Contents

 

What is Newton?

Newton is a Slurm based computational cluster. It is a group of servers that provide computational resources to researchers.

What is Slurm?

As the name suggests, Slurm Workload Manager is a software that allocates resources for computational jobs based on predefined settings.
Contrary to regular (or local) program execution, programs (jobs) are launched on the Login (or Controller) server and are distributed to one or more of the physical servers (Nodes). SLURM is the software that helps defining and executing these jobs, as well as managing users, permissions and resource allocation. It helps tracking and displaying job details as well.

Hardware

The cluster is comprised of one Controller and several Nodes.
The Controller is a virtual machine named ‘newton’, to which users log in and from which they launch jobs.
The Nodes are physical machines named newton1, nlp-2080-1, isl-titan etc., on which the jobs run.
SSH connections are only performed to the Controller. Users cannot connect directly to the nodes.

Software

Both controller and nodes run Ubuntu 18.04 server and the latest version of Slurm.
Packages installed on the nodes:

  • Nvidia Drivers v450.66
  • CUDA v10.2.89
  • cuDNN v7.6.5
  • OpenMPI v4.0.4
  • OpenGL and xvfb
  • Python 2.7.17 and 3.6.9

User packages

You would most likely need to install additional packages for your own work.
Creating a virtual environment using Miniconda is supported.
pip3 is installed and can be used to install user packages (using the ‑‑user argument). See the User Guide for instructions, specifically the User Installs section. Note that Python 2.7 has reached End of Life, so pip (for Python 2.x) is not installed although Python 2.7.17 is.

top

Using Slurm

Where to start?

First of all, you’ll need to receive log in and work permissions on Newton . This is usually done by a request of the research supervisor to the IT department.

Note: Newton is not available from outside the Technion. If you’re working outside the campus, you will need to connect to the Technion’s network via VPN as described here: English | Hebrew

After receiving login permissions, SSH to ‘newton.cs.technion.ac.il’:
ssh ‑X <username>@newton.cs.technion.ac.il
If you’re connected via VPN and you receive an Unknown Host error, use:
ssh -X <username>@132.68.39.200

Enter STAFF\\<username> if your user is a Technion user and not a CS one (ask us if you’re not sure).

After logging in, you’ll be able to use these command:

sinfo – shows the state of each node:

    • idle: No job currently runs on the node.
    • mix: Jobs currently run on this node.
    • down: Node error.
    • drain: The node waits for its jobs to finish and won’t accept new jobs (usually before boot).
    • boot: The node is rebooting (usually after an update).

squeue – show the currently running and pending jobs. Use ‘squeue -u $USER’ to view only your jobs.

Job states (ST):

      • R: The job is running.
      • PD: The job is pending (waiting in queue to run).
      • CG: The job is completing and will be finished soon.

If the job is pending (PD), the NODELIST(REASON) column will indicate why:

      • Priority: A higher priority job is pending execution.
      • Resources: The job has the highest priority and is waiting for resources.
      • QOS*Limit: You have the maximum number of jobs running. The job will launch when another one finishes. Refer to the Job and resource limits section.
      • launch failed requeued held: Job launch failed due to an unknown reason. Use scontrol to requeue the job. Contact us if the problem persists.

scontrol – used to control jobs. You can only control your jobs.

    • scontrol show job <id>: show details for job with id <id> (JOBID column in squeue).
    • scontrol requeue <id>: requeue job with id <id>

scancel – cancels a job. You can only cancel your jobs.

    • scancel job <id>

Your home folder is /home/<username>, is shared across the network and is the same folder on all the nodes.
Important: Although the storage system is fault-tolerant, backup your files regularly to an external location.

Note: Although using SSH via a command shell (on Windows, Linux or Mac) is possible, you may want to take advantage of free graphical clients such as MobaXterm which also provides an SFTP file explorer to enable easy to use drag-and-drop file transfer to and from the server.

How to launch jobs

Slurm has two commands which you can use to launch jobs: srun and sbatch.
srun is straightforward: srun <arguments> <program>
You can use various arguments to control the resource allocation of your job, as well as other settings your job requires.
You can find the full list of available arguments here. Some of the most common arguments are:
-c # – allocate # CPUs per task.
‑‑gres=gpu:# – allocate # GPUs.
‑‑mpi=<mpi_type> – define the type of mpi to use. The default (when isn’t explicitly specified) is pmi2.
‑‑pty – run job in pseudo terminal mode. Useful when running bash.
-p <partition> – run job on the selected partition instead of the default one.
-w <node> – run job on a specific node.

Check out the Partitions, accounts and QoS section for more on selecting partitions and nodes.

Examples:
1. To get a shell with two GPUs, run:
srun ‑‑gres=gpu:2 ‑‑pty /bin/bash
run ‘nvidia-smi’ to verify the job received two GPUs.
2. Run the script ‘script.py’ using python3:
srun python3 script.py

sbatch lets you use a Slurm script to run jobs. You can find more information here.

Using containers

What Is Singularity?

Singularity is a free, cross-platform isolated environment and open-source computer program that performs operating-system-level virtualization also known as containerization.

Using singularity:

  1. ssh ‑X <username>@newton.cs.technion.ac.il
    If you’re connected via VPN and you receive an Unknown Host error, use:
    ssh -X <username>@132.68.39.200
  2. Ask for resources:

srun  -c2 --gres=gpu:1 --pty bash

3. Download docker image from docker hub

singularity pull –disable-cache “name of local container” docker://container_image

singularity shell -fw name of local container.sif

-f => get root accsess for container

-w => writable file system

–nv => use GPU

Example:

singularity pull –disable-cache ubuntu.sif docker://ubuntu:18.04

run and test container:

singularity shell -fw ubuntu.sif

Singularity> cat /etc/os-release

you can also build your own image “build process

top

Workload management

Slurm uses a Fair Share queueing system. The system assigns job priority based on cluster load and recent user activity in order to give every user a fair chance to execute jobs.
You will only notice the queuing system when the cluster is full.

Priority calculation

This is a rough description of the priority calculation:
Each resource has a billing “cost”. When a job completes, its total cost is calculated using the allocated resources’ cost and the job’s run time. The user is then “billed” for that cost.
When the user runs another job, it enters the queue. Its priority is determined relative to the queued jobs – the lower the user’s bill, the higher the priority his job gets.
There are some other parameters used for calculating priority. One of them is queued time – the longer a job waits in the queue, the higher its priority.
Another layer of Fair Share is the Account system: Each course has an account (usually the course’s number). Projects are all in the same account. Billing is also calculated across accounts. Thus, if course A has a lower total billing cost (combined billing costs of all jobs executed from that account) than course B, jobs queued from course A will get priority over course B’s jobs.

Some notes:

  • Cost is calculated for allocated resources, not used resources: If you ask for two GPUs, for example, and only use one – you will be billed for two GPUs. That’s one reason you shouldn’t overschedule resources.
  • Currently, only CPUs and GPUs have a ‘cost’.
  • There may be situations where user X from course A has a higher billing than user Y from course B, and yet the former’s jobs receive a higher priority than the latter. This is the desired behavior to avoid the cluster from being choked by a specific course.
  • Billing costs decay over time.
  • Billing cost is for priority calculation only. You won’t be charged real money.
  • As mentioned at the beginning of this section – you will only notice the queueing system when the cluster is full and there aren’t enough available resources to allocate to your job.

Partitions, accounts and QoS

A partition is a logical group of physical nodes. A node can be part of more than one partition. A slurm job will launch on a specific partition, indicated using the -p argument. If a partition is not specified, it will launch on the default partition.
Accounts are groups of users. On Newton, every user belongs to one (and only one) account as is selected by default. The account a user belongs to defines which partitions they can launch their jobs on. You can check your account by logging into Newton and running the command:
sacctmgr show user $USER withassoc format=user,account
Quality of Service (QoS) is another control group entity that helps the Fair Share queueing system and to define resource limits. Newton currently has only one QoS for all users and there are no resource limits. You can ignore QoS at this time.

Newton’s nodes belong to staff members and research groups in the faculty. As such, researchers will always receive priority running jobs on their servers. Usage of other resources, including for users which don’t belong to any research group, are under the condition of availability.
The rationale behind creating Newton as a community cluster is to maximize resource utilization, while guaranteeing ‘private’ resource availability and providing computational resources for researchers who don’t have them.

We use the term Golden Ticket to describe priority over resources for a specific group of users on specific nodes, and we use Slurm’s partition system to define golden tickets.

Preemption is the process of deferring a job by a job with a higher priority over resources. The preemption method Newton uses is Requeue – meaning preempted jobs will be returned to the queue instead of being canceled or paused.

Partitions:
public: This partition includes all the nodes in the cluster. Every user can launch job on it. It is selected by default. Jobs on this partition can be preempted.
nlp: private partition, only usable by account nlp. No preemption. Nodes: nlp-2080-1, nlp-2080-2
isl: private partition, only usable by account isl. Highest priority. Node: isl-titan

Accounts:
cs – public account. Can only launch jobs on the public partition.
nlp – private account. Can launch jobs on the nlp and public partitions
isl – private account. Can launch jobs on the isl and public partitions

The bottom line(s):
If you’re on the cs account, don’t indicate the working partition (public will be selected by default). Your job will enter the queue and be assigned resources when they become available. If your job has returned to the queue, it means that a higher priority job required the resources your job has been allocated. Your job has been requeued and will continue running when resources become available.

If you’re on a private account, indicate the partition you want to run on using the -p argument. Running on your private partition will give you priority over the cs account, but the job will only launch on one of your private nodes. indicating the public partition (or not using the -p argument at all) will queue your job for one of all the nodes, but the job could be preempted if a higher priority job enters the queue.

Selecting specific nodes:
It is possible to specify nodes for a job. This is done using the argument -w and a coma-separated list of nodes. The node list must be a subset of the partition’s nodes (i.e. you can’t choose isl-titan when working on the nlp partition, for example).
Manually choosing nodes is highly discouraged and should be avoided if possible – Slurm does a decent job deciding which node to assign each job to.
Use case for choosing a specific node:
Each job must be assigned with at least one CPU (CPU, in this case, is a processor thread). When using GPUs, the CPU-GPU affinity must be considered. When all of a node’s GPUs are in use, sometimes not all of its CPUs are. If your job requires only CPUs and no GPUs, it’s worth asking for such a node – this could reduce the possibility that your job will be preempted.
Use the command snode to view the current resource usage of each node in the cluster.
Choose a node using -w <node> (e.g. srun -w nlp-2080-1 ….)

CPU-GPU affinity in a nutshell:
GPUs communicate with CPUs using the PCIe bus. On a single socket system, the CPU controls all the PCIe lanes. On a dual-socket system (Newton’s nodes are all dual-socket), each CPU controls (usually) half of the total PCIe lanes. Usually, CPU0 controls GPU0-3 while CPU1 controls GPU4-7 (on an 8 GPU system).
There may be situations where a node has free (unallocated) GPUs, but not enough (or not at all) CPU threads to fulfill a job’s requirements, thus making the appearance where a job is waiting for available resources when there seems like there already are.

Consider this case: A node with 2 CPUs, each with 20 threads, and 8 GPUs. The nodes run 2 jobs, each requires 12 threads and 2 GPUs. The first jobs
is allocated cores 0-11 on CPU0 and also GPUs 0 and 1. The second job also requires 12 threads and 2 GPUs. GPUs 2 and 3 are available, but CPU0 has not enough free threads (only 12-19 are free). The job is then allocated with threads 0-11 on CPU1 and GPUs 4 and 5 (remember, CPU1 controls GPUs 4-7). At this point, a third job is launched which requires the same amount of resources: 12 threads and 2 GPUs. Overall, the server has 16 free threads (8 on each CPU) and 4 GPUs (2 on each CPU), but no single CPU has enough free threads to fulfill the job’s resource requirements, forcing the job to stay in queue until one of the two original jobs finishes.
Situations like these are one of the reasons that several smaller jobs are preferable to one big job – it allows for better resource allocation.

top

Job and resource limits

There are currently no job or resource limit on Newton.

top

Code of Conduct

Priority algorithms and resource limits have their limitations. Sharing computational resources requires every user to play nice and fair. Be mindful of the fact that other students are using the system and on a typical semester – around 300 students may have permissions on Lambda at a time.
Please follow these guidelines so that everyone can have a positive experience:

  • Don’t abuse the queue: Even if there are no limits on the number of jobs you can queue, don’t overflow it.
  • Don’t abuse loopholes: No system is perfect and no system is watertight. If you find a scenario on which you can bypass job, resource or queue limits – please report.
  • Close idle jobs: The system cannot tell if you’re currently in front of your PC or just left Jupyter Notebook open and went to sleep – these resources could be used by someone else.
  • Don’t overschedule resources: If you need just one GPU – ask for one GPU. Hogging resources needlessly affects everyone – including you, when calculating priority for future jobs.
  • Prefer small jobs over one massive script: If you can modulate your work – please do. This helps for better job scheduling as well as protects against jobs fails.
  • Be nice: Your work is as important for you as everyone else’s work for them. Use common sense when sending jobs.
  • Don’t wait until the last minute: The cluster tends to be flooded with jobs on the week before a submission deadline. Take that into account when managing your time.
  • Understand workload management: The cluster promises to run every job at a reasonable time – it does not promise to run your job RIGHT NOW. Again: Manage your time accordingly.

top

System updates and reboots

As with any computer system, software updates are rolling out steadily. Those include various package, security and driver updates and must be applied on a regular basis in order to maintain system security, stability, usability and performance. These updates often require system reboot (yes, even in Linux).
Node (physical servers) reboots will enter a DRAINING state upon a reboot request. They will not be allocated with new jobs and will reboot after all running jobs have been completed. After reboot (typically 2-3 minutes), they will automatically return to a ready (IDLE) state. Nodes will be rebooted one-by-one in order to minimize work interruption and to maintain maximum cluster availability.
Controller (newton) reboots are more complicated – they will not affect running jobs, but will end all SSH sessions immediately. There’s currently no good solution to scheduling these reboots and preventing abruptly closing SSH sessions, and they must be performed during work hours and be monitored in case there’s a problem.
If your session was unexpectedly closed – that is probably why. The typical reboot time is ~1 minute. Your running jobs won’t be affected, but unsaved file changes, for example, will be lost. We’re always looking for ways to improve the system and we’ll update this section when a good solution is found. We’ll try to keep Controller reboots to a minimum.

top

Troubleshooting

Problem: Error message “ssh: Could not resolve hostname newton: Name or service not known”.
Solution: Occurs mainly when connecting from outside the campus. Make sure the VPN is connected and try to SSH to 132.68.39.200 instead.

Problem: Error message “Remote side unexpectedly closed network connection” when trying to upload or download files to or from the server.
Solution: SSH sessions limit is 10. Close other connections and try again.

Problem: Job pending.
Solution: Resource over-scheduling (for the pending job) or cluster load. See the Workload management and Job and resource limits sections.

Problem: Error message “Could not load dynamic library ‘libcudart.so.*’”
Solution: If you receive this error on newton(the login server) – launch the script on a node using Slurm. newton is a virtual server, has no GPUs and doesn’t have CUDA installed.
If you receive the error on a node:
Add these lines at the end of the .bashrc file in your home folder (~/.bashrc):
export PATH=”/usr/local/cuda/bin:$PATH”
export LD_LIBRARY_PATH=”/usr/local/cuda/lib64:$LD_LIBRARY_PATH”
Save the file, exit the session (or job) and reconnect. Start bash on a node:
srun –pty /bin/bash
and run this command
nvcc -V
You should get CUDA’s version details.