SLURM

slurm!

What is Slurm?

As the name suggests, Slurm Workload Manager is a software that allocates resources for computational jobs based on predefined settings.
Contrary to regular (or local) program execution, programs (jobs) are launched on the Login (or Controller) server and are distributed to one or more of the physical servers (Nodes). SLURM is the software that helps defining and executing these jobs, as well as managing users, permissions and resource allocation. It helps tracking and displaying job details as well.

Useful Commands .

After logging in, you’ll be able to use these command:

sinfo -N – shows the state of each node:

    • idle: No job currently runs on the node.
    • mix: Jobs currently run on this node.
    • down: Node error.
    • drain (drng): The node waits for its jobs to finish and won’t accept new jobs (usually before boot).
    • boot: The node is rebooting (usually after an update).

squeue – show the currently running and pending jobs. Use ‘squeue -u $USER’ to view only your jobs.

Job states (ST):

      • R: The job is running.
      • PD: The job is pending (waiting in queue to run).
      • CG: The job is completing and will be finished soon.

If the job is pending (PD), the NODELIST(REASON) column will indicate why:

      • Priority: A higher priority job is pending execution.
      • Resources: The job has the highest priority and is waiting for resources.
      • QOS*Limit: You have the maximum number of jobs running. The job will launch when another one finishes. Refer to the Job and resource limits section.
      • launch failed requeued held: Job launch failed due to an unknown reason. Use scontrol to requeue the job. Contact us if the problem persists.

scontrol – used to control jobs. You can only control your jobs.

    • scontrol show job <id>: show details for job with id <id> (JOBID column in squeue).
    • scontrol requeue <id>: requeue job with id <id>

scancel – cancels a job. You can only cancel your jobs.

    • scancel job <id>

Your home folder is /home/<username>, is shared across the network and is the same folder on all the nodes. If you’re coming from Rishon cluster – your home folder is the same.
Important: Although the storage system is fault-tolerant, backup your files regularly to an external location. Make sure you have a backup when you finish a course or a project – home folders of inactive users will be deleted without prior notice.

Note: Although using SSH via a command shell (on Windows, Linux or Mac) is possible, you may want to take advantage of free graphical clients such as MobaXterm which also provides an SFTP file explorer to enable easy to use drag-and-drop file transfer to and from the server.