[Autoscaler] Remove staroid node provider (#22236)

The Staroid node provider has been abandoned and unmaintained for quite some time now. Due to the fact that there are no active maintainers, the original contributors cannot be reached, and there is no clear interest, we are no longer officially endorsing or supporting the node provider.

Co-authored-by: Alex Wu <alex@anyscale.com>
This commit is contained in:
Alex Wu 2022-02-09 09:18:18 -08:00 committed by GitHub
parent 323511b716
commit c9a419ac76
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
9 changed files with 0 additions and 1161 deletions

View file

@ -116,37 +116,6 @@ Ray with cloud providers
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
.. tabbed:: Staroid Kubernetes Engine (contributed)
The Ray Cluster Launcher can be used to start Ray clusters on an existing Staroid Kubernetes Engine (SKE) cluster.
First, install the staroid client package (``pip install staroid``) then get `access token <https://staroid.com/settings/accesstokens>`_.
Once you have an access token, you should be ready to launch your cluster.
The provided `ray/python/ray/autoscaler/staroid/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/staroid/example-full.yaml>`__ cluster config file will create a cluster with
- a Jupyter notebook running on head node.
(Staroid management console -> Kubernetes -> ``<your_ske_name>`` -> ``<ray_cluster_name>`` -> Click "notebook")
- a shared nfs volume across all ray nodes mounted under ``/nfs`` directory.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Configure access token through environment variable.
$ export STAROID_ACCESS_TOKEN=<your access token>
# Create or update the cluster. When the command finishes,
# you can attach a screen to the head node.
$ ray up ray/python/ray/autoscaler/staroid/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/staroid/example-full.yaml
$ # Try running a Ray program with 'ray.init(address="auto")'.
# Tear down the cluster
$ ray down ray/python/ray/autoscaler/staroid/example-full.yaml
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.

View file

@ -89,12 +89,6 @@ def _import_kuberay(provider_config):
return KuberayNodeProvider
def _import_staroid(provider_config):
from ray.autoscaler._private.staroid.node_provider import StaroidNodeProvider
return StaroidNodeProvider
def _import_aliyun(provider_config):
from ray.autoscaler._private.aliyun.node_provider import AliyunNodeProvider
@ -139,12 +133,6 @@ def _load_azure_defaults_config():
return os.path.join(os.path.dirname(ray_azure.__file__), "defaults.yaml")
def _load_staroid_defaults_config():
import ray.autoscaler.staroid as ray_staroid
return os.path.join(os.path.dirname(ray_staroid.__file__), "defaults.yaml")
def _load_aliyun_defaults_config():
import ray.autoscaler.aliyun as ray_aliyun
@ -164,7 +152,6 @@ _NODE_PROVIDERS = {
"aws": _import_aws,
"gcp": _import_gcp,
"azure": _import_azure,
"staroid": _import_staroid,
"kubernetes": _import_kubernetes,
"kuberay": _import_kuberay,
"aliyun": _import_aliyun,
@ -179,7 +166,6 @@ _PROVIDER_PRETTY_NAMES = {
"aws": "AWS",
"gcp": "GCP",
"azure": "Azure",
"staroid": "Staroid",
"kubernetes": "Kubernetes",
"kuberay": "Kuberay",
"aliyun": "Aliyun",
@ -192,7 +178,6 @@ _DEFAULT_CONFIGS = {
"aws": _load_aws_defaults_config,
"gcp": _load_gcp_defaults_config,
"azure": _load_azure_defaults_config,
"staroid": _load_staroid_defaults_config,
"aliyun": _load_aliyun_defaults_config,
"kubernetes": _load_kubernetes_defaults_config,
}

View file

@ -1,318 +0,0 @@
# An unique identifier for the head node and workers of this cluster.
# A namespace will be automatically created for each cluster_name in SKE.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Kubernetes resources that need to be configured for the autoscaler to be
# able to manage the Ray cluster. If any of the provided resources don't
# exist, the autoscaler will attempt to create them. If this fails, you may
# not have the required permissions and will have to request them to be
# created by your cluster administrator.
provider:
type: staroid
# Access token for Staroid from https://staroid.com/settings/accesstokens.
# Alternatively, you can set STAROID_ACCESS_TOKEN environment variable.
# https://github.com/staroids/staroid-python#configuration
# for more information.
access_token:
# Staroid account to use. e.g. GITHUB/staroids
# Alternatively, you can set STAROID_ACCOUNT environment variable.
# Leave empty to select default account for given access token.
# https://github.com/staroids/staroid-python#configuration
# for more information.
account:
# Name of a Staroid Kubernetes Engine (SKE) instance.
# Alternatively, you can set STAROID_SKE environment variable.
# An SKE is a virtualized Kubernetes cluster.
# Will create a new if not exists.
ske: "Ray cluster"
# Cloud and Region to create an SKE when not exists.
# If SKE already exists, this value will be ignored.
# Supported cloud region can be found
# https://docs.staroid.com/ske/cloudregion.html.
ske_region: "aws us-west2"
# To create a namespace in SKE, you need to specify a Github project.
# The Github project needs to have a staroid.yaml
# (https://docs.staroid.com/references/staroid_yaml.html).
# staroid.yaml defines various resources for the project, such as
# - Building container images can be accessed from the namespace
# - Kubernetes resources to create (like Persistent volume claim)
# on namespace creation
# You can fork when you need to customize.
# 1. Fork github.com/open-datastudio/ray
# 2. Change .staroid/ directory to cutomize
# 3. Connect forked repository (https://staroid.com/projects/settings)
# 4. Release your customized branch
# 4-1. Select project from 'My projects' menu
# 4-2. Select your branch in 'Release' tab
# 4-3. After build success, switch to 'Production'
# 4-4. Switch Launch permission to 'Public' if required
# 5. Change 'project' field to point your
# repository and branch in this file
project: "GITHUB/open-datastudio/ray:master-staroid"
# 'spec.containers.image' field for ray-node and ray-worker will be
# overrided by the image built from the 'project' field above.
# Set this value to 'false' to not override the image.
image_from_project: true
# Python version to use. One of '3.6.9', '3.7.7', '3.8.3'.
# 'project' field above provides docker image for each python version.
# Fork 'project' if you'd like to support other python versions.
python_version: 3.7.7
# Exposing external IP addresses for ray pods isn't currently supported.
use_internal_ips: true
head_node_type: ray.head.default
available_node_types:
ray.head.default:
resources: {"CPU": 1}
min_workers: 0
max_workers: 0
# Kubernetes pod config for the head node pod.
node_config:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-head-
# Must match the head node service selector above if a head node
# service is required.
labels:
component: ray-head
# https://docs.staroid.com/ske/pod.html#pod
pod.staroid.com/spot: "false" # use on-demand instance for head.
# Uncomment to locate ray head to dedicated Kubernetes node
# (GPU instance is only available for 'dedicated' isolation)
#pod.staroid.com/isolation: dedicated
#pod.staroid.com/instance-type: gpu-1
spec:
automountServiceAccountToken: true
# Restarting the head node automatically is not currently supported.
# If the head node goes down, `ray up` must be run again.
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
# nfs volume provides a shared volume across all ray-nodes.
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: ray-node
imagePullPolicy: Always
# You are free (and encouraged) to use your own container image,
# but it should have the following installed:
# - rsync (used for `ray rsync` commands and file mounts)
# - screen (used for `ray attach`)
# - kubectl (used by the autoscaler to manage worker pods)
# Image will be overridden when 'image_from_project' is true.
image: rayproject/ray
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["touch ~/.bashrc; trap : TERM INT; sleep infinity & wait;"]
ports:
- containerPort: 6379 # Redis port.
- containerPort: 6380 # Redis port.
- containerPort: 6381 # Redis port.
- containerPort: 22345 # Ray internal communication.
- containerPort: 22346 # Ray internal communication.
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /nfs
name: nfs-volume
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 2Gi
env:
# This is used in the head_start_ray_commands below so that
# Ray can spawn the correct number of processes. Omitting this
# may lead to degraded performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: requests.cpu
- name: RAY_ADDRESS
value: "auto"
ray.worker.default:
min_workers: 0
resources: {"CPU": 1}
# Kubernetes pod config for worker node pods.
node_config:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-worker-
# Must match the worker node service selector above if a worker node
# service is required.
labels:
component: ray-worker
# https://docs.staroid.com/ske/pod.html#pod
pod.staroid.com/spot: "true" # use spot instance for workers.
# Uncomment to locate ray head to dedicated Kubernetes node
# (GPU instance is only available for 'dedicated' isolation)
#pod.staroid.com/isolation: dedicated
#pod.staroid.com/instance-type: gpu-1
spec:
serviceAccountName: default
# Worker nodes will be managed automatically by the head node, so
# do not change the restart policy.
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: ray-node
imagePullPolicy: Always
# You are free (and encouraged) to use your own container image,
# but it should have the following installed:
# - rsync (used for `ray rsync` commands and file mounts)
image: rayproject/autoscaler
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["touch ~/.bashrc; trap : TERM INT; sleep infinity & wait;"]
ports:
- containerPort: 22345 # Ray internal communication.
- containerPort: 22346 # Ray internal communication.
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /nfs
name: nfs-volume
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
# This memory limit will be detected by ray and split into
# 30% for plasma, and 70% for workers.
memory: 2Gi
env:
# This is used in the head_start_ray_commands below so that
# Ray can spawn the correct number of processes. Omitting this
# may lead to degraded performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: requests.cpu
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude: []
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter: []
# List of shell commands to run to set up nodes.
setup_commands: []
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
# install staroid and kubernetes packages. Staroid node provider depends on them which autoscaler will use.
- pip install -q staroid kubernetes
# install jupyterlab
- pip install -q jupyterlab
- ln -s /nfs /home/ray/nfs
- bash -c 'jupyter-lab --ip="*" --NotebookApp.token="" --NotebookApp.password="" --NotebookApp.allow_origin="*" --NotebookApp.notebook_dir="/home/ray"' &
# show 'notebook' link in staroid management console to access jupyter notebook.
- 'echo -e "kind: Service\napiVersion: v1\nmetadata:\n name: notebook\n annotations:\n service.staroid.com/link: show\nspec:\n ports:\n - name: http\n port: 8888\n selector:\n component: ray-head" | kubectl apply -f -'
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
# Note webui-host is set to 0.0.0.0 so that kubernetes can port forward.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --num-cpus=$MY_CPU_REQUEST --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --num-cpus=$MY_CPU_REQUEST --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}

View file

@ -1,330 +0,0 @@
# An unique identifier for the head node and workers of this cluster.
# A namespace will be automatically created for each cluster_name in SKE.
cluster_name: default # name with 'a-z' and '-'
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 0
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 5
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Kubernetes resources that need to be configured for the autoscaler to be
# able to manage the Ray cluster. If any of the provided resources don't
# exist, the autoscaler will attempt to create them. If this fails, you may
# not have the required permissions and will have to request them to be
# created by your cluster administrator.
provider:
type: staroid
# Access token for Staroid from https://staroid.com/settings/accesstokens.
# Alternatively, you can set STAROID_ACCESS_TOKEN environment variable.
# https://github.com/staroids/staroid-python#configuration
# for more information.
access_token:
# Staroid account to use. e.g. GITHUB/staroids
# Alternatively, you can set STAROID_ACCOUNT environment variable.
# Leave empty to select default account for given access token.
# https://github.com/staroids/staroid-python#configuration
# for more information.
account:
# Name of a Staroid Kubernetes Engine (SKE) instance.
# Alternatively, you can set STAROID_SKE environment variable.
# An SKE is a virtualized Kubernetes cluster.
# Will create a new if not exists.
ske: "Ray cluster"
# Cloud and Region to create an SKE when not exists.
# If SKE already exists, this value will be ignored.
# Supported cloud region can be found
# https://docs.staroid.com/ske/cloudregion.html.
ske_region: "aws us-west2"
# To create a namespace in SKE, you need to specify a Github project.
# The Github project needs to have a staroid.yaml
# (https://docs.staroid.com/references/staroid_yaml.html).
# staroid.yaml defines various resources for the project, such as
# - Building container images can be accessed from the namespace
# - Kubernetes resources to create (like Persistent volume claim)
# on namespace creation
# You can fork when you need to customize.
# 1. Fork github.com/open-datastudio/ray-cluster
# 2. Change contents
# 3. Connect forked repository (https://staroid.com/projects/settings)
# 4. Release your customized branch
# 4-1. Select project from 'My projects' menu
# 4-2. Select your branch in 'Release' tab
# 4-3. After build success, switch to 'Production'
# 4-4. Switch Launch permission to 'Public' if required
# 5. Change 'project' field to point your
# repository and branch in this file
project: "GITHUB/open-datastudio/ray-cluster:master"
# 'spec.containers.image' field for ray-node and ray-worker will be
# overrided by the image built from the 'project' field above.
# Set this value to 'false' to not override the image.
image_from_project: true
# Python version to use. One of '3.6.9', '3.7.7', '3.8.3'.
# 'project' field above provides docker image for each python version.
# Fork 'project' if you'd like to support other python versions.
python_version: 3.7.7
# Exposing external IP addresses for ray pods isn't currently supported.
use_internal_ips: true
# Kubernetes pod config for the head node pod.
head_node:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-head-
# Must match the head node service selector above if a head node
# service is required.
labels:
component: ray-head
# https://docs.staroid.com/ske/pod.html
pod.staroid.com/spot: "false" # use on-demand instance for head.
# Locate ray head to dedicated Kubernetes node
# In dedicated mode, resource requests and limits in the pod spec will be
# automatically overrided based on 'pod.staroid.com/instance-type' below.
pod.staroid.com/isolation: dedicated # 'sandboxed' or 'dedicated'
# Instance type to use in 'dedicated' mode, such as 'standard-4', 'gpu-1'.
# See available instance type from https://docs.staroid.com/ske/pod.html.
pod.staroid.com/instance-type: standard-4
spec:
automountServiceAccountToken: true
# Restarting the head node automatically is not currently supported.
# If the head node goes down, `ray up` must be run again.
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: tmp-volume
emptyDir: {}
# nfs volume provides a shared volume across all ray-nodes.
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: ray-node
imagePullPolicy: Always
# You are free (and encouraged) to use your own container image,
# but it should have the following installed:
# - rsync (used for `ray rsync` commands and file mounts)
# - screen (used for `ray attach`)
# - kubectl (used by the autoscaler to manage worker pods)
# Image will be overridden when 'image_from_project' is true.
image: rayproject/ray
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["touch ~/.bashrc; trap : TERM INT; sleep infinity & wait;"]
ports:
- containerPort: 6379 # Redis port.
- containerPort: 6380 # Redis port.
- containerPort: 6381 # Redis port.
- containerPort: 22345 # Ray internal communication.
- containerPort: 22346 # Ray internal communication.
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /tmp
name: tmp-volume
- mountPath: /nfs
name: nfs-volume
resources:
requests:
cpu: 4000m
memory: 8Gi
limits:
cpu: 4000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 8Gi
env:
# This is used in the head_start_ray_commands below so that
# Ray can spawn the correct number of processes. Omitting this
# may lead to degraded performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: limits.cpu
- name: RAY_ADDRESS
value: "auto"
# Kubernetes pod config for worker node pods.
worker_nodes:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-worker-
# Must match the worker node service selector above if a worker node
# service is required.
labels:
component: ray-worker
# https://docs.staroid.com/ske/pod.html
pod.staroid.com/spot: "true"
# Locate ray head to dedicated Kubernetes node
# In dedicated mode, resource requests and limits in the pod spec will be
# automatically overrided based on 'pod.staroid.com/instance-type' below.
pod.staroid.com/isolation: dedicated # 'sandboxed' or 'dedicated'
# Instance type to use in 'dedicated' mode, such as 'standard-4', 'gpu-1'.
# See available instance type from https://docs.staroid.com/ske/pod.html.
pod.staroid.com/instance-type: standard-4
spec:
serviceAccountName: default
# Worker nodes will be managed automatically by the head node, so
# do not change the restart policy.
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: tmp-volume
emptyDir: {}
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: ray-node
imagePullPolicy: Always
# You are free (and encouraged) to use your own container image,
# but it should have the following installed:
# - rsync (used for `ray rsync` commands and file mounts)
image: rayproject/autoscaler
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["touch ~/.bashrc; trap : TERM INT; sleep infinity & wait;"]
ports:
- containerPort: 22345 # Ray internal communication.
- containerPort: 22346 # Ray internal communication.
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /tmp
name: tmp-volume
- mountPath: /nfs
name: nfs-volume
resources:
requests:
cpu: 4000m
memory: 8Gi
limits:
cpu: 4000m
# This memory limit will be detected by ray and split into
# 30% for plasma, and 70% for workers.
memory: 8Gi
env:
# This is used in the head_start_ray_commands below so that
# Ray can spawn the correct number of processes. Omitting this
# may lead to degraded performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: limits.cpu
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands: []
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
# install staroid and kubernetes packages. Staroid node provider depends on them which autoscaler will use.
- pip install -q staroid kubernetes
# install jupyterlab
- pip install -q jupyterlab
- ln -s /nfs /home/ray/nfs
- bash -c 'jupyter-lab --ip="*" --NotebookApp.token="" --NotebookApp.password="" --NotebookApp.allow_origin="*" --NotebookApp.notebook_dir="/home/ray"' &
# show 'notebook' link in staroid management console to access jupyter notebook.
- 'echo -e "kind: Service\napiVersion: v1\nmetadata:\n name: notebook\n annotations:\n service.staroid.com/link: show\nspec:\n ports:\n - name: http\n port: 8888\n selector:\n component: ray-head" | kubectl apply -f -'
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
# Note webui-host is set to 0.0.0.0 so that kubernetes can port forward.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --num-cpus=$MY_CPU_REQUEST --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --num-cpus=$MY_CPU_REQUEST --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

View file

@ -1,281 +0,0 @@
# An unique identifier for the head node and workers of this cluster.
# A namespace will be automatically created for each cluster_name in SKE.
cluster_name: default # name with 'a-z' and '-'
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 0
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 5
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Kubernetes resources that need to be configured for the autoscaler to be
# able to manage the Ray cluster. If any of the provided resources don't
# exist, the autoscaler will attempt to create them. If this fails, you may
# not have the required permissions and will have to request them to be
# created by your cluster administrator.
provider:
type: staroid
# Access token for Staroid from https://staroid.com/settings/accesstokens.
# Alternatively, you can set STAROID_ACCESS_TOKEN environment variable.
# https://github.com/staroids/staroid-python#configuration
# for more information.
access_token:
# Staroid account to use. e.g. GITHUB/staroids
# Alternatively, you can set STAROID_ACCOUNT environment variable.
# Leave empty to select default account for given access token.
# https://github.com/staroids/staroid-python#configuration
# for more information.
account:
# Name of a Staroid Kubernetes Engine (SKE) instance.
# Alternatively, you can set STAROID_SKE environment variable.
# An SKE is a virtualized Kubernetes cluster.
# Will create a new if not exists.
ske: "Ray cluster"
# Cloud and Region to create an SKE when not exists.
# If SKE already exists, this value will be ignored.
# Supported cloud region can be found
# https://docs.staroid.com/ske/cloudregion.html.
ske_region: "aws us-west2"
# To create a namespace in SKE, you need to specify a Github project.
# The Github project needs to have a staroid.yaml
# (https://docs.staroid.com/references/staroid_yaml.html).
# staroid.yaml defines various resources for the project, such as
# - Building container images can be accessed from the namespace
# - Kubernetes resources to create (like Persistent volume claim)
# on namespace creation
# You can fork when you need to customize.
# 1. Fork github.com/open-datastudio/ray-cluster
# 2. Change contents
# 3. Connect forked repository (https://staroid.com/projects/settings)
# 4. Release your customized branch
# 4-1. Select project from 'My projects' menu
# 4-2. Select your branch in 'Release' tab
# 4-3. After build success, switch to 'Production'
# 4-4. Switch Launch permission to 'Public' if required
# 5. Change 'project' field to point your
# repository and branch in this file
project: "GITHUB/open-datastudio/ray-cluster:master"
# 'spec.containers.image' field for ray-node and ray-worker will be
# overrided by the image built from the 'project' field above.
# Set this value to 'false' to not override the image.
image_from_project: true
# Python version to use. One of '3.6.9', '3.7.7', '3.8.3'.
# 'project' field above provides docker image for each python version.
# Fork 'project' if you'd like to support other python versions.
python_version: 3.7.7
# Exposing external IP addresses for ray pods isn't currently supported.
use_internal_ips: true
# Kubernetes pod config for the head node pod.
head_node:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-head-
# Must match the head node service selector above if a head node
# service is required.
labels:
component: ray-head
# Locate this Pod to spot instance or not.
# https://docs.staroid.com/ske/pod.html
pod.staroid.com/spot: "false" # use on-demand instance for head.
# Locate ray head to dedicated Kubernetes node or not.
# 'sandboxed' (default) or 'dedicated'.
pod.staroid.com/isolation: dedicated
# Instance type to use in 'dedicated' mode, such as 'standard-4', 'gpu-1'.
# See available instance type from https://docs.staroid.com/ske/pod.html.
pod.staroid.com/instance-type: gpu-1
spec:
automountServiceAccountToken: true
# Restarting the head node automatically is not currently supported.
# If the head node goes down, `ray up` must be run again.
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: tmp-volume
emptyDir: {}
# nfs volume provides a shared volume across all ray-nodes.
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: ray-node
imagePullPolicy: Always
# You are free (and encouraged) to use your own container image,
# but it should have the following installed:
# - rsync (used for `ray rsync` commands and file mounts)
# - screen (used for `ray attach`)
# - kubectl (used by the autoscaler to manage worker pods)
# Image will be overriden when 'image_from_project' is true.
image: rayproject/autoscaler
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["touch ~/.bashrc; trap : TERM INT; sleep infinity & wait;"]
ports:
- containerPort: 6379 # Redis port.
- containerPort: 6380 # Redis port.
- containerPort: 6381 # Redis port.
- containerPort: 22345 # Ray internal communication.
- containerPort: 22346 # Ray internal communication.
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /tmp
name: tmp-volume
- mountPath: /nfs
name: nfs-volume
resources:
# in case of 'pod.staroid.com/isolation' is 'dedicated',
# cpu and memory requests/limits in resources field will be
# automatically configured based on
# 'pod.staroid.com/instance-type'
requests:
cpu: 4000m
memory: 8Gi
limits:
cpu: 4000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 8Gi
env:
# This is used in the head_start_ray_commands below so that
# Ray can spawn the correct number of processes. Omitting this
# may lead to degraded performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: limits.cpu
- name: RAY_ADDRESS
value: "auto"
# Kubernetes pod config for worker node pods.
worker_nodes:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-worker-
# Must match the worker node service selector above if a worker node
# service is required.
labels:
component: ray-worker
# Locate this Pod to spot instance or not.
# https://docs.staroid.com/ske/pod.html
pod.staroid.com/spot: "true" # use on-demand instance for head.
# Locate ray head to dedicated Kubernetes node or not.
# 'sandboxed' (default) or 'dedicated'.
pod.staroid.com/isolation: dedicated
# Instance type to use in 'dedicated' mode, such as 'standard-4', 'gpu-1'.
# See available instance type from https://docs.staroid.com/ske/pod.html.
pod.staroid.com/instance-type: gpu-1
spec:
serviceAccountName: default
# Worker nodes will be managed automatically by the head node, so
# do not change the restart policy.
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: tmp-volume
emptyDir: {}
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: ray-node
imagePullPolicy: Always
# You are free (and encouraged) to use your own container image,
# but it should have the following installed:
# - rsync (used for `ray rsync` commands and file mounts)
image: rayproject/autoscaler
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["touch ~/.bashrc; trap : TERM INT; sleep infinity & wait;"]
ports:
- containerPort: 22345 # Ray internal communication.
- containerPort: 22346 # Ray internal communication.
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /tmp
name: tmp-volume
- mountPath: /nfs
name: nfs-volume
resources:
# in case of 'pod.staroid.com/isolation' is 'dedicated',
# cpu and memory requests/limits in resources field will be
# automatically configured based on
# 'pod.staroid.com/instance-type'
requests:
cpu: 4000m
memory: 8Gi
limits:
cpu: 4000m
# This memory limit will be detected by ray and split into
# 30% for plasma, and 70% for workers.
memory: 8Gi
env:
# This is used in the head_start_ray_commands below so that
# Ray can spawn the correct number of processes. Omitting this
# may lead to degraded performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: limits.cpu

View file

@ -1,72 +0,0 @@
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal # name with 'a-z' and '-'
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
max_workers: 5
# Kubernetes resources that need to be configured for the autoscaler to be
# able to manage the Ray cluster. If any of the provided resources don't
# exist, the autoscaler will attempt to create them. If this fails, you may
# not have the required permissions and will have to request them to be
# created by your cluster administrator.
provider:
type: staroid
# Access token for Staroid from https://staroid.com/settings/accesstokens.
# Alternatively, you can set STAROID_ACCESS_TOKEN environment variable.
# https://github.com/staroids/staroid-python#configuration
# for more information.
access_token:
# Staroid account to use. e.g. GITHUB/staroids
# Alternatively, you can set STAROID_ACCOUNT environment variable.
# Leave empty to select default account for given access token.
# https://github.com/staroids/staroid-python#configuration
# for more information.
account:
# Name of a Staroid Kubernetes Engine (SKE) instance.
# Alternatively, you can set STAROID_SKE environment variable.
# An SKE is a virtualized Kubernetes cluster.
# Will create a new if not exists.
ske: "Ray cluster"
# Cloud and Region to create an SKE when not exists.
# If SKE already exists, this value will be ignored.
# Supported cloud region can be found
# https://docs.staroid.com/ske/cloudregion.html.
ske_region: "aws us-west2"
# To create a namespace in SKE, you need to specify a Github project.
# The Github project needs to have a staroid.yaml
# (https://docs.staroid.com/references/staroid_yaml.html).
# staroid.yaml defines various resources for the project, such as
# - Building container images can be accessed from the namespace
# - Kubernetes resources to create (like Persistent volume claim)
# on namespace creation
# You can fork when you need to customize.
# 1. Fork github.com/open-datastudio/ray-cluster
# 2. Change contents
# 3. Connect forked repository (https://staroid.com/projects/settings)
# 4. Release your customized branch
# 4-1. Select project from 'My projects' menu
# 4-2. Select your branch in 'Release' tab
# 4-3. After build success, switch to 'Production'
# 4-4. Switch Launch permission to 'Public' if required
# 5. Change 'project' field to point your
# repository and branch in this file
project: "GITHUB/open-datastudio/ray-cluster:master"
# 'spec.containers.image' field for ray-node and ray-worker will be
# overrided by the image built from the 'project' field above.
# Set this value to 'false' to not override the image.
image_from_project: true
# Python version to use. One of '3.6.9', '3.7.7', '3.8.3'.
# 'project' field above provides docker image for each python version.
# Fork 'project' if you'd like to support other python versions.
python_version: 3.7.7
# Exposing external IP addresses for ray pods isn't currently supported.
use_internal_ips: true

View file

@ -1,113 +0,0 @@
# an example of configuring a mixed-node-type cluster.
cluster_name: multi-node-type # name with 'a-z' and '-'
max_workers: 40
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# Cloud-provider specific configuration.
provider:
type: staroid
access_token:
account:
ske: "Ray cluster"
ske_region: "aws us-west2"
project: "GITHUB/open-datastudio/ray-cluster:master"
image_from_project: true
python_version: 3.7.7
use_internal_ips: true
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
cpu_2_ondemand:
node_config:
metadata:
labels:
pod.staroid.com/spot: "false"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: standard-2
resources: {"CPU": 2}
max_workers: 10
cpu_4_ondemand:
node_config:
metadata:
labels:
pod.staroid.com/spot: "false"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: standard-4
resources: {"CPU": 4}
max_workers: 10
cpu_8_ondemand:
node_config:
metadata:
labels:
pod.staroid.com/spot: "false"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: standard-8
resources: {"CPU": 8}
max_workers: 10
gpu_1_ondemand:
node_config:
metadata:
labels:
pod.staroid.com/spot: "false"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: gpu-1
resources: {"CPU": 8, "GPU": 1, "accelerator_type:V100": 1}
max_workers: 10
cpu_2_spot:
node_config:
metadata:
labels:
pod.staroid.com/spot: "true"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: standard-2
resources: {"CPU": 2}
max_workers: 10
cpu_4_spot:
node_config:
metadata:
labels:
pod.staroid.com/spot: "true"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: standard-4
resources: {"CPU": 4}
max_workers: 10
cpu_8_spot:
node_config:
metadata:
labels:
pod.staroid.com/spot: "true"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: standard-8
resources: {"CPU": 8}
max_workers: 10
# worker_setup_commands:
# - pip install tensorflow-gpu # Example command.
gpu_1_spot:
node_config:
metadata:
labels:
pod.staroid.com/spot: "true"
pod.staroid.com/isolation: dedicated
pod.staroid.com/instance-type: gpu-1
resources: {"CPU": 8, "GPU": 1, "accelerator_type:V100": 1}
max_workers: 10
# Specify the node type of the head node (as configured above).
head_node_type: cpu_4_ondemand
# The default settings for the head node. This will be merged with the per-node
# type configs given above.
#head_node:
# The default settings for worker nodes. This will be merged with the per-node
# type configs given above.
#worker_nodes:
idle_timeout_minutes: 5

View file

@ -185,7 +185,6 @@ ray_files += [
"ray/autoscaler/local/defaults.yaml",
"ray/autoscaler/kubernetes/defaults.yaml",
"ray/autoscaler/_private/_kubernetes/kubectl-rsync.sh",
"ray/autoscaler/staroid/defaults.yaml",
"ray/autoscaler/ray-schema.json",
]