ray/doc/source/cluster/slurm.rst

.. _ray-slurm-deploy:

Deploying on Slurm
==================

Clusters managed by Slurm may require that Ray is initialized as a part of the submitted job. This can be done by using ``srun`` within the submitted script. For example:

.. code-block:: bash

  #!/bin/bash

  #SBATCH --job-name=test
  #SBATCH --cpus-per-task=5
  #SBATCH --mem-per-cpu=1GB
  #SBATCH --nodes=3
  #SBATCH --tasks-per-node 1

  worker_num=2 # Must be one less that the total number of nodes

  # module load Langs/Python/3.6.4 # This will vary depending on your environment
  # source venv/bin/activate

  nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
  nodes_array=( $nodes )

  node1=${nodes_array[0]}

  ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address
  suffix=':6379'
  ip_head=$ip_prefix$suffix
  redis_password=$(uuidgen)

  export ip_head # Exporting for latter access by trainer.py

  srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head --redis-port=6379 --redis-password=$redis_password & # Starting the head
  sleep 5
  # Make sure the head successfully starts before any worker does, otherwise
  # the worker will not be able to connect to redis. In case of longer delay,
  # adjust the sleeptime above to ensure proper order.

  for ((  i=1; i<=$worker_num; i++ ))
  do
    node2=${nodes_array[$i]}
    srun --nodes=1 --ntasks=1 -w $node2 ray start --block --address=$ip_head --redis-password=$redis_password & # Starting the workers
    # Flag --block will keep ray process alive on each compute node.
    sleep 5
  done

  python -u trainer.py $redis_password 15 # Pass the total number of allocated CPUs

.. code-block:: python

  # trainer.py
  from collections import Counter
  import os
  import sys
  import time
  import ray

  redis_password = sys.argv[1]
  num_cpus = int(sys.argv[2])

  ray.init(address=os.environ["ip_head"], redis_password=redis_password)

  print("Nodes in the Ray cluster:")
  print(ray.nodes())

  @ray.remote
  def f():
      time.sleep(1)
      return ray.services.get_node_ip_address()

  # The following takes one second (assuming that ray was able to access all of the allocated nodes).
  for i in range(60):
      start = time.time()
      ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
      print(Counter(ip_addresses))
      end = time.time()
      print(end - start)
[docs] Revised Cluster documentation (#9062) Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> 2020-06-26 09:29:22 -07:00			`.. _ray-slurm-deploy:`

[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`Deploying on Slurm`
			`==================`

			Clusters managed by Slurm may require that Ray is initialized as a part of the submitted job. This can be done by using ``srun`` within the submitted script. For example:

			`.. code-block:: bash`

			`#!/bin/bash`

			`#SBATCH --job-name=test`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`#SBATCH --cpus-per-task=5`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`#SBATCH --mem-per-cpu=1GB`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`#SBATCH --nodes=3`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`#SBATCH --tasks-per-node 1`

Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`worker_num=2 # Must be one less that the total number of nodes`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`# module load Langs/Python/3.6.4 # This will vary depending on your environment`
			`# source venv/bin/activate`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00
			`nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names`
			`nodes_array=( $nodes )`

			`node1=${nodes_array[0]}`

Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			`ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`suffix=':6379'`
			`ip_head=$ip_prefix$suffix`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`redis_password=$(uuidgen)`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00
			`export ip_head # Exporting for latter access by trainer.py`

Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head --redis-port=6379 --redis-password=$redis_password & # Starting the head`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`sleep 5`
[docs] Comments on potential srun orders during Slurm Deployment (#8183) 2020-04-27 09:30:16 -07:00			`# Make sure the head successfully starts before any worker does, otherwise`
[docs] Revised Cluster documentation (#9062) Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> 2020-06-26 09:29:22 -07:00			`# the worker will not be able to connect to redis. In case of longer delay,`
[docs] Comments on potential srun orders during Slurm Deployment (#8183) 2020-04-27 09:30:16 -07:00			`# adjust the sleeptime above to ensure proper order.`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00
			`for (( i=1; i<=$worker_num; i++ ))`
			`do`
			`node2=${nodes_array[$i]}`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`srun --nodes=1 --ntasks=1 -w $node2 ray start --block --address=$ip_head --redis-password=$redis_password & # Starting the workers`
[docs] Comments on potential srun orders during Slurm Deployment (#8183) 2020-04-27 09:30:16 -07:00			`# Flag --block will keep ray process alive on each compute node.`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`sleep 5`
			`done`

Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`python -u trainer.py $redis_password 15 # Pass the total number of allocated CPUs`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00
			`.. code-block:: python`

			`# trainer.py`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`from collections import Counter`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`import os`
			`import sys`
			`import time`
			`import ray`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00
			`redis_password = sys.argv[1]`
			`num_cpus = int(sys.argv[2])`

			`ray.init(address=os.environ["ip_head"], redis_password=redis_password)`

			`print("Nodes in the Ray cluster:")`
			`print(ray.nodes())`

[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00			`@ray.remote`
			`def f():`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`time.sleep(1)`
			`return ray.services.get_node_ip_address()`
[docs] Added Instructions for Slurm (#5467) * Added Instructions for Slurm Made in response to #826 2019-08-19 00:46:26 -04:00
			`# The following takes one second (assuming that ray was able to access all of the allocated nodes).`
Set redis password in slurm deployment documentation. (#5747) 2019-09-21 15:33:15 -07:00			`for i in range(60):`
			`start = time.time()`
			`ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])`
			`print(Counter(ip_addresses))`
			`end = time.time()`
			`print(end - start)`