Commit graph

6562 commits

Author SHA1 Message Date
Edward Oakes
f22a34bd4f
Restore "[Serve] Implement Default DAGDriver (#23301)" (#23373) 2022-03-21 10:35:00 -07:00
Kai Fricke
b64452bc63
[tune] Add multinode sync test (#23229)
This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.
2022-03-21 17:02:17 +00:00
Guyang Song
69af9764b2
[runtime env] URI reference refactor (#22828)
- Move the URI reference logic from raylet to agent.
- Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible`
- More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528

Future works
- We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this.
- Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count.
- Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?
2022-03-21 11:21:15 -05:00
Stephanie Wang
e507aa5758
Revert "[Serve] Implement Default DAGDriver (#23301)" (#23358)
This reverts commit 91a1c3411f.
2022-03-21 10:54:52 -05:00
Larry
81dcf9ff35
[Placement Group] Make PlacementGroupID generate from JobID (#23175) 2022-03-21 17:09:16 +08:00
Avnish Narayan
e008a48ef2
[release tests] Pin gym everywhere (#23349) 2022-03-19 02:52:54 -07:00
Philipp Moritz
886cc4d674
Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
Simon Mo
91a1c3411f
[Serve] Implement Default DAGDriver (#23301) 2022-03-18 18:07:39 -07:00
Siyuan (Ryans) Zhuang
65cc877ad8
[workflow] Ensure that DAGs are dereferenced like ObjectRefs in Ray tasks (#23320) 2022-03-18 17:02:15 -07:00
Jiao
9b38b6de47
[Serve] [Pipeline] Default all DeploymentNode route_prefix to None, and "/" for the root driver (#23289) 2022-03-18 16:56:49 -07:00
shrekris-anyscale
c668039020
[serve] Restore "Get new handle to controller if killed" (#23283) (#23338)
#23336 reverted #23283. #23283 did pass CI before merging. However, when it merged, it began to fail because it used commands that were outdated on the Master branch in `test_cli.py` (specifically `serve info` instead of `serve config`). This change restores #23283 and updates its tests commands.
2022-03-18 18:40:08 -05:00
Jiao
49e0ab2f58
[Serve] [Pipeline] Use ServeSchema for deployment prevent config got overridden (#23324) 2022-03-18 15:25:32 -07:00
mwtian
909cdea3cd
[Python Worker] add feature flag to support forking from workers (#23260)
Make sure Python dependencies can be imported on demand, without the background importer thread. Use cases are:

If the pubsub notification for a new export is lost, importing can still be done.
Allow not running the background importer thread, without affecting Ray's functionalities.
Add a feature flag to support forking from Python workers, by

Enable fork support in gRPC.
Disable importer thread and only leave the main thread in the Python worker. The importer thread will not run after forking anyway.
2022-03-18 14:47:18 -07:00
Junwen Yao
8fff665455
[Train] Add torch data prefetch benchmark example (#22974)
Add a benchmark example for the auto pipeline functionality for host to device data transfer.
2022-03-18 13:27:26 -07:00
Eric Liang
c4b52d34ca
Initial PR for internal storage API (#22889) 2022-03-18 12:32:40 -07:00
shrekris-anyscale
87e77bebb4
Revert "[serve] Get new handle to controller if killed (#23283)" (#23336)
This reverts commit 9f6d96a2fd.
2022-03-18 13:47:57 -05:00
Jialing He
4a83bc3dc2
[runtime env] Support set timeout for runtime env setup (#23082)
Interface example:
```python
@ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10))
def f(): pass

@ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}})
def f(): pass
```

Support set timeout second for timeout of runtime environment creation.

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-03-18 12:52:59 -05:00
Archit Kulkarni
76bb5396c7
[Doc] [jobs] Add links to Job Submission and improve doc (#23209)
- Adds links to Job Submission from existing library tutorials where `ray submit` is used.  When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 12:52:13 -05:00
Archit Kulkarni
16fd099b8b
[runtime env] Change pip_check default from True to False (#23306)
@SongGuyang @Catch-Bull @edoakes  I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default.  My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise.  So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up.  Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR.

This also matches the default behavior of pip which allows this situation to occur in the first place:  `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")
2022-03-18 12:51:41 -05:00
shrekris-anyscale
9f6d96a2fd
[serve] Get new handle to controller if killed (#23283)
`serve shutdown` is not idempotent with the new Serve CLI. When serve shuts down, it kills the controller. The REST API does not refresh its cached controller handle, so it attempts to make requests to a dead actor, which fail.

This change updates the REST API and `serve.start()` to refresh the controller handle if the controller has been killed.
2022-03-18 11:47:18 -05:00
shrekris-anyscale
aaf47b2493
[serve] Implement serve.run() and Application (#23157)
These changes expose `Application` as a public API. They also introduce a new public method, `serve.run()`, which allows users to deploy their `Applications` or `DeploymentNodes`. Additionally, the Serve CLI's `run` command and Serve's REST API are updated to use `Applications` and `serve.run()`.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 11:12:09 -05:00
Kai Fricke
3836333aac
[ml/air] Checkpoints serialization/deserialization support (#23275)
This PR adds support for checkpoint ser/de. In particular this is special casing the local data representation, which will be converted into a bytes checkpoint on serialization. This way checkpoint objects sent to remote tasks are guaranteed to always point to a valid data location within the remote task.
We are not detecting pickling to/from disk (e.g. to pickle files) for now.
2022-03-18 13:10:37 +00:00
Amog Kamsetty
bb4ff42eec
[ml] TorchTrainer bug fixes + GPU test (#23293)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-03-17 23:49:42 -07:00
Amog Kamsetty
0f9233fc01
[ml] Switch from tune.run to Tuner.fit (#23282)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-03-17 23:48:38 -07:00
Jiao
e02577adb7
[Pipeline] Add and use RayServeLazyHandle for DAG deployment args (#23256) 2022-03-17 22:58:31 -07:00
matthewdeng
2298bcc3f9
[ml] raise error when serializing Predictor (#23267) 2022-03-17 21:11:34 -07:00
Andrew Li
1a293a1187
Providing additional useful messages for JSONDecodeError (#23116)
According to #22535 , I added additional and useful information when encountering the JSONDecodeError.
2022-03-17 20:58:43 -07:00
Guyang Song
1ad019aac3
[C++ API][Doc] Add doc and error log to notice C++ API is not supported on Windows (#23272)
We don't support Windows entirely now.

## Checks

- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2022-03-18 10:52:57 +08:00
Jiajun Yao
62a5404369
Collect more usage stats data (#23167) 2022-03-17 19:33:27 -07:00
Jiao
ea51017e52
[Ray DAG][Serve Pipeline] better error messages on .bind and .remote with tests (#23290) 2022-03-17 18:58:09 -07:00
shrekris-anyscale
1b30bfa972
[serve] Implement set_options (#23265) 2022-03-17 17:09:55 -07:00
Edward Oakes
04ab27dcbf
[serve] Fix ServeHandle JSON Serde (#23285) 2022-03-17 16:35:19 -07:00
Chris K. W
6416c65505
Revert "Revert "[Client] chunked get requests (#22455)"" (#23261)
* revert revertchunkedgets

* exit early if all chunks received, tighter exception handler for stream in proxy
2022-03-17 16:24:30 -07:00
Siyuan (Ryans) Zhuang
f74ad24901
Cleanup nits in code (#23112)
* cleanup code

* fix comments
2022-03-17 15:55:35 -07:00
Amog Kamsetty
d31d6bc9bb
[Docker] Add Train requirements to ray-ml docker image (#22645) 2022-03-17 15:07:32 -07:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Simon Mo
6cc0fee947
[Serve] Improve function deployment API (#23252) 2022-03-17 14:37:43 -07:00
mwtian
1d2d60a2fc
[GCS-Ray] remove Redis password from CLI messages (#23242)
Redis password should not be needed in the connection info printed by `ray start --head`.
We can make another cleanup for removing flags and arguments related to Redis password. But it is a bit more risky (affects external Redis) and needs more care.
2022-03-17 13:36:29 -07:00
Simon Mo
f400b4333a
[Serve] Remove legacy pipeline codebase (#23172) 2022-03-17 13:27:16 -07:00
Antoni Baum
1211c452d4
[ML/Train] TensorflowTrainer implementation (#23250)
Implements `TensorflowTrainer`. Depends on https://github.com/ray-project/ray/pull/23211 (review only files with `tensorflow` in the name).

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-17 11:34:47 -07:00
Siyuan (Ryans) Zhuang
0f61e2f90e
[Lint] Cleanup incorrectly formatted strings (Part 5: util) (#23264) 2022-03-17 10:27:05 -07:00
Antoni Baum
f71e7681b3
[ML] XGBoost&LightGBMTrainer implementation (#23245)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-17 10:00:03 -07:00
Dmitri Gekhtman
c707ad8d73
Fix GCP node termination (#23101)
Skips 404s on node termination for GCP node provider.
Also resets internal "self.nodes_to_terminate" state at the start of an autoscaler iteration -- that's necessary for correct cleanup in the event of failed node termination.
2022-03-17 09:51:16 -07:00
Amog Kamsetty
cf512254bb
[ml/train] Don't create new BackendExecutor actor in Trainable (#23235)
If using the DataParallelTrainer, since we are running the BackendExecutor in a Trainable actor already, we don't need to create a new actor.

However if using Ray Train directly, we still want to run BackendExecutor in an actor for performance with Ray Client.

This PR does some refactoring to support both cases.
2022-03-17 08:31:43 -07:00
xwjiang2010
c12d437fb5
[tune] de-spam some logging. (#23247)
Demoting some logger calls to debug
2022-03-17 15:03:38 +00:00
Siyuan (Ryans) Zhuang
cb80518a80
[Lint] Cleanup incorrectly formatted strings (Part 4: tests, _private) (#23263) 2022-03-17 00:49:16 -07:00
Amog Kamsetty
ef0b85c344
[ml/train] TorchTrainer implementation (#23219) 2022-03-17 00:07:27 -07:00
Gagandeep Singh
c32649b85c
map and map_unordered cancel previous tasks before submitting new ones (#23187)
N.B. - https://github.com/ray-project/ray/issues/23107#issuecomment-1068107507
2022-03-16 23:45:44 -07:00
Siyuan (Ryans) Zhuang
cc1728120f
[Tune] Move resource updater out of trial executor (#23178)
* simplify trial executor

* update test

* fix: proper resource update before initialization

* add test to BUILD

* add doc for resource updater
2022-03-16 22:50:47 -07:00
xwjiang2010
814b49356c
[tuner] Tuner impl. (#22848) 2022-03-16 20:55:30 -07:00