hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Edward Oakes	f22a34bd4f	Restore "[Serve] Implement Default DAGDriver (#23301 )" (#23373 )	2022-03-21 10:35:00 -07:00
Kai Fricke	b64452bc63	[tune] Add multinode sync test (#23229 ) This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.	2022-03-21 17:02:17 +00:00
Guyang Song	69af9764b2	[runtime env] URI reference refactor (#22828 ) - Move the URI reference logic from raylet to agent. - Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible` - More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528 Future works - We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this. - Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count. - Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?	2022-03-21 11:21:15 -05:00
Stephanie Wang	e507aa5758	Revert "[Serve] Implement Default DAGDriver (#23301 )" (#23358 ) This reverts commit `91a1c3411f`.	2022-03-21 10:54:52 -05:00
Larry	81dcf9ff35	[Placement Group] Make PlacementGroupID generate from JobID (#23175 )	2022-03-21 17:09:16 +08:00
Avnish Narayan	e008a48ef2	[release tests] Pin gym everywhere (#23349 )	2022-03-19 02:52:54 -07:00
Philipp Moritz	886cc4d674	Fix broken links in documentation and put linkcheck linter in place on CI (#23340 )	2022-03-18 21:02:52 -07:00
Simon Mo	91a1c3411f	[Serve] Implement Default DAGDriver (#23301 )	2022-03-18 18:07:39 -07:00
Siyuan (Ryans) Zhuang	65cc877ad8	[workflow] Ensure that DAGs are dereferenced like ObjectRefs in Ray tasks (#23320 )	2022-03-18 17:02:15 -07:00
Jiao	9b38b6de47	[Serve] [Pipeline] Default all DeploymentNode route_prefix to None, and "/" for the root driver (#23289 )	2022-03-18 16:56:49 -07:00
shrekris-anyscale	c668039020	[serve] Restore "Get new handle to controller if killed" (#23283 ) (#23338 ) #23336 reverted #23283. #23283 did pass CI before merging. However, when it merged, it began to fail because it used commands that were outdated on the Master branch in `test_cli.py` (specifically `serve info` instead of `serve config`). This change restores #23283 and updates its tests commands.	2022-03-18 18:40:08 -05:00
Jiao	49e0ab2f58	[Serve] [Pipeline] Use ServeSchema for deployment prevent config got overridden (#23324 )	2022-03-18 15:25:32 -07:00
mwtian	909cdea3cd	[Python Worker] add feature flag to support forking from workers (#23260 ) Make sure Python dependencies can be imported on demand, without the background importer thread. Use cases are: If the pubsub notification for a new export is lost, importing can still be done. Allow not running the background importer thread, without affecting Ray's functionalities. Add a feature flag to support forking from Python workers, by Enable fork support in gRPC. Disable importer thread and only leave the main thread in the Python worker. The importer thread will not run after forking anyway.	2022-03-18 14:47:18 -07:00
Junwen Yao	8fff665455	[Train] Add torch data prefetch benchmark example (#22974 ) Add a benchmark example for the auto pipeline functionality for host to device data transfer.	2022-03-18 13:27:26 -07:00
Eric Liang	c4b52d34ca	Initial PR for internal storage API (#22889 )	2022-03-18 12:32:40 -07:00
shrekris-anyscale	87e77bebb4	Revert "[serve] Get new handle to controller if killed (#23283 )" (#23336 ) This reverts commit `9f6d96a2fd`.	2022-03-18 13:47:57 -05:00
Jialing He	4a83bc3dc2	[runtime env] Support set timeout for runtime env setup (#23082 ) Interface example: ```python @ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10)) def f(): pass @ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}}) def f(): pass ``` Support set timeout second for timeout of runtime environment creation. Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-03-18 12:52:59 -05:00
Archit Kulkarni	76bb5396c7	[Doc] [jobs] Add links to Job Submission and improve doc (#23209 ) - Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested. - Adds docstrings for the Jobs SDK, which automatically show up in the API reference - Improve the Job Submission main page - Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-18 12:52:13 -05:00
Archit Kulkarni	16fd099b8b	[runtime env] Change `pip_check` default from `True` to `False` (#23306 ) @SongGuyang @Catch-Bull @edoakes I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default. My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise. So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up. Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR. This also matches the default behavior of pip which allows this situation to occur in the first place: `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")	2022-03-18 12:51:41 -05:00
shrekris-anyscale	9f6d96a2fd	[serve] Get new handle to controller if killed (#23283 ) `serve shutdown` is not idempotent with the new Serve CLI. When serve shuts down, it kills the controller. The REST API does not refresh its cached controller handle, so it attempts to make requests to a dead actor, which fail. This change updates the REST API and `serve.start()` to refresh the controller handle if the controller has been killed.	2022-03-18 11:47:18 -05:00
shrekris-anyscale	aaf47b2493	[serve] Implement `serve.run()` and `Application` (#23157 ) These changes expose `Application` as a public API. They also introduce a new public method, `serve.run()`, which allows users to deploy their `Applications` or `DeploymentNodes`. Additionally, the Serve CLI's `run` command and Serve's REST API are updated to use `Applications` and `serve.run()`. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-18 11:12:09 -05:00
Kai Fricke	3836333aac	[ml/air] Checkpoints serialization/deserialization support (#23275 ) This PR adds support for checkpoint ser/de. In particular this is special casing the local data representation, which will be converted into a bytes checkpoint on serialization. This way checkpoint objects sent to remote tasks are guaranteed to always point to a valid data location within the remote task. We are not detecting pickling to/from disk (e.g. to pickle files) for now.	2022-03-18 13:10:37 +00:00
Amog Kamsetty	bb4ff42eec	[ml] `TorchTrainer` bug fixes + GPU test (#23293 ) Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-03-17 23:49:42 -07:00
Amog Kamsetty	0f9233fc01	[ml] Switch from `tune.run` to `Tuner.fit` (#23282 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-03-17 23:48:38 -07:00
Jiao	e02577adb7	[Pipeline] Add and use RayServeLazyHandle for DAG deployment args (#23256 )	2022-03-17 22:58:31 -07:00
matthewdeng	2298bcc3f9	[ml] raise error when serializing Predictor (#23267 )	2022-03-17 21:11:34 -07:00
Andrew Li	1a293a1187	Providing additional useful messages for JSONDecodeError (#23116 ) According to #22535 , I added additional and useful information when encountering the JSONDecodeError.	2022-03-17 20:58:43 -07:00
Guyang Song	1ad019aac3	[C++ API][Doc] Add doc and error log to notice C++ API is not supported on Windows (#23272 ) We don't support Windows entirely now. ## Checks - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(	2022-03-18 10:52:57 +08:00
Jiajun Yao	62a5404369	Collect more usage stats data (#23167 )	2022-03-17 19:33:27 -07:00
Jiao	ea51017e52	[Ray DAG][Serve Pipeline] better error messages on .bind and .remote with tests (#23290 )	2022-03-17 18:58:09 -07:00
shrekris-anyscale	1b30bfa972	[serve] Implement set_options (#23265 )	2022-03-17 17:09:55 -07:00
Edward Oakes	04ab27dcbf	[serve] Fix ServeHandle JSON Serde (#23285 )	2022-03-17 16:35:19 -07:00
Chris K. W	6416c65505	Revert "Revert "[Client] chunked get requests (#22455 )"" (#23261 ) * revert revertchunkedgets * exit early if all chunks received, tighter exception handler for stream in proxy	2022-03-17 16:24:30 -07:00
Siyuan (Ryans) Zhuang	f74ad24901	Cleanup nits in code (#23112 ) * cleanup code * fix comments	2022-03-17 15:55:35 -07:00
Amog Kamsetty	d31d6bc9bb	[Docker] Add Train requirements to ray-ml docker image (#22645 )	2022-03-17 15:07:32 -07:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
Simon Mo	6cc0fee947	[Serve] Improve function deployment API (#23252 )	2022-03-17 14:37:43 -07:00
mwtian	1d2d60a2fc	[GCS-Ray] remove Redis password from CLI messages (#23242 ) Redis password should not be needed in the connection info printed by `ray start --head`. We can make another cleanup for removing flags and arguments related to Redis password. But it is a bit more risky (affects external Redis) and needs more care.	2022-03-17 13:36:29 -07:00
Simon Mo	f400b4333a	[Serve] Remove legacy pipeline codebase (#23172 )	2022-03-17 13:27:16 -07:00
Antoni Baum	1211c452d4	[ML/Train] `TensorflowTrainer` implementation (#23250 ) Implements `TensorflowTrainer`. Depends on https://github.com/ray-project/ray/pull/23211 (review only files with `tensorflow` in the name). Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2022-03-17 11:34:47 -07:00
Siyuan (Ryans) Zhuang	0f61e2f90e	[Lint] Cleanup incorrectly formatted strings (Part 5: util) (#23264 )	2022-03-17 10:27:05 -07:00
Antoni Baum	f71e7681b3	[ML] `XGBoost`&`LightGBMTrainer` implementation (#23245 ) Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2022-03-17 10:00:03 -07:00
Dmitri Gekhtman	c707ad8d73	Fix GCP node termination (#23101 ) Skips 404s on node termination for GCP node provider. Also resets internal "self.nodes_to_terminate" state at the start of an autoscaler iteration -- that's necessary for correct cleanup in the event of failed node termination.	2022-03-17 09:51:16 -07:00
Amog Kamsetty	cf512254bb	[ml/train] Don't create new `BackendExecutor` actor in `Trainable` (#23235 ) If using the DataParallelTrainer, since we are running the BackendExecutor in a Trainable actor already, we don't need to create a new actor. However if using Ray Train directly, we still want to run BackendExecutor in an actor for performance with Ray Client. This PR does some refactoring to support both cases.	2022-03-17 08:31:43 -07:00
xwjiang2010	c12d437fb5	[tune] de-spam some logging. (#23247 ) Demoting some logger calls to debug	2022-03-17 15:03:38 +00:00
Siyuan (Ryans) Zhuang	cb80518a80	[Lint] Cleanup incorrectly formatted strings (Part 4: tests, _private) (#23263 )	2022-03-17 00:49:16 -07:00
Amog Kamsetty	ef0b85c344	[ml/train] `TorchTrainer` implementation (#23219 )	2022-03-17 00:07:27 -07:00
Gagandeep Singh	c32649b85c	`map` and `map_unordered` cancel previous tasks before submitting new ones (#23187 ) N.B. - https://github.com/ray-project/ray/issues/23107#issuecomment-1068107507	2022-03-16 23:45:44 -07:00
Siyuan (Ryans) Zhuang	cc1728120f	[Tune] Move resource updater out of trial executor (#23178 ) * simplify trial executor * update test * fix: proper resource update before initialization * add test to BUILD * add doc for resource updater	2022-03-16 22:50:47 -07:00
xwjiang2010	814b49356c	[tuner] Tuner impl. (#22848 )	2022-03-16 20:55:30 -07:00

... 3 4 5 6 7 ...

6562 commits