Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`.
This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).
In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs.
- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before.
- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.
For public SDK APIs, change the import path from
```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```
to
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```
`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address.
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
This PR refactors several components to support switching to GCS address bootstrapping later:
- Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address.
- Ray client servers support `--address` flag instead of `--redis-address`.
- A few other miscellaneous cleanup.
Also, add a test for starting non-head node with `ray start`.
This PR contains most of the fixes @iycheng made in #21232, to make tests pass with GCS bootstrapping by supporting both Redis and GCS address as the bootstrap address. The main change is to use address_info["address"] to obtain the bootstrap address to pass to ray.init(), instead of using address_info["redis_address"]. In a subsequent PR, address_info["address"] will return the Redis or GCS address depending on whether using GCS to bootstrap.
This reverts commit 968f08607b.
It is breaking e2e tests where worker nodes cannot start. e.g.
```
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1961, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 733, in start
address_ip, password=redis_password)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 593, in create_redis_client
_, redis_ip_address, redis_port = validate_bootstrap_address(redis_address)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 494, in validate_bootstrap_address
raise ValueError("Malformed address. Expected '<host>:<port>'.")
ValueError: Malformed address. Expected '<host>:<port>'.
```
This PR unreverts #21115, fixing the handling of an `"auto"` address in the `RAY_ADDRESS` environment variable.
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
This change adds support for parsing `--address` as bootstrap address, and treating `--port` as GCS port, when using GCS for bootstrapping.
Not launching Redis in GCS bootstrapping mode, and using GCS to fetch initial cluster information, will be implemented in a subsequent change.
Also made some cleanups.
Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration.
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>