Commit graph

6315 commits

Author SHA1 Message Date
Gagandeep Singh
c32649b85c
map and map_unordered cancel previous tasks before submitting new ones (#23187)
N.B. - https://github.com/ray-project/ray/issues/23107#issuecomment-1068107507
2022-03-16 23:45:44 -07:00
Siyuan (Ryans) Zhuang
cc1728120f
[Tune] Move resource updater out of trial executor (#23178)
* simplify trial executor

* update test

* fix: proper resource update before initialization

* add test to BUILD

* add doc for resource updater
2022-03-16 22:50:47 -07:00
xwjiang2010
814b49356c
[tuner] Tuner impl. (#22848) 2022-03-16 20:55:30 -07:00
Balaji Veeramani
83986a4d83
[Train] Add support for automatic mixed precision (#22227)
Closes #20643

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-19.us-west-2.compute.internal>
2022-03-16 20:53:02 -07:00
Amog Kamsetty
f33a495b3a
[ml/train] DataParallelTrainer implementation (#23211)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-03-16 19:49:44 -07:00
mwtian
391901f86b
[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233)
Continue to clean up Redis and other related Redis references, for
- gcs_utils.py
- log_monitor.py
- `publish_error_to_driver()`
2022-03-16 19:34:57 -07:00
SangBin Cho
b350fe9ee8
[Nightly test] Fix additional k8s issues + add new tests (#23231)
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
2022-03-16 16:37:29 -07:00
Archit Kulkarni
8707eb6288
[runtime env] Support .whl files in py_modules (#22368)
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster.  One gap in this is if the local Python module is in the form of a wheel (`.whl` file.)  This PR adds the missing support for uploading and installing the `.whl` file.
2022-03-16 16:37:10 -05:00
shrekris-anyscale
84b3de6825
[serve] Add atomic delete (#23195) 2022-03-16 14:13:10 -07:00
Jiao
2bcbe41d54
[Serve] Polish new deployment to DAG binding API with Ray DAG tests (#23208) 2022-03-16 12:59:19 -07:00
Siyuan (Ryans) Zhuang
6d83a3f283
[Lint] Cleanup incorrectly formatted strings (Part 3: components) (#23130) 2022-03-16 12:36:57 -07:00
Edward Oakes
d1a528d6af
[serve] Use deploy_group in serve run and set HTTP options (#23215) 2022-03-16 12:37:21 -05:00
shrekris-anyscale
56ddea85a1
[Serve] Fix typo language (#23213) 2022-03-16 10:14:44 -07:00
shrekris-anyscale
34ebb3409e
[serve] Make Dashboard start Serve in the "serve" namespace (#23198)
The Ray Dashboard starts Serve in the `"_ray_internal_dashboard"` namespace. However, Serve by default starts in the `"serve"` namespace. This causes surprising behavior when working with the Serve CLI and REST API.

This change make the Ray Dashboard start Serve in the `"serve"` namespace, allowing the REST API to work intuitively with the Python API.
2022-03-16 12:03:44 -05:00
Kai Fricke
b80f79a072
[ci/multinode] Improve multi-node tests (#23196)
The current multi node tests use a hardcoded mapping for local development mounts. With this PR, a new environment variable is introduced to be able to control this dynamically. Additionally, some minor improvements to the test utilities and monitor are added.
2022-03-16 09:59:50 +00:00
Siyuan (Ryans) Zhuang
d67c34256b
[Workflow] Optimize out tail recursion in python (#22794)
* add test

* warning when inplace subworkflows may use different resources
2022-03-16 01:51:18 -07:00
Gagandeep Singh
60a3340387
[workflow] Suggestions of correct inputs to create_storage in error message under windows (#23190)
* Provide suggestions of correct inputs to create_storage in error msg

* Applied linting format

* Added test for verifying error message
2022-03-16 01:42:12 -07:00
Siyuan (Ryans) Zhuang
7c43c66b6b
[workflow] Implement workflow continuation unification (#23217)
* implement workflow continuation unification

* fix comments

* fix: strict scope for workflow execution
2022-03-16 00:04:01 -07:00
mwtian
72ef9f91aa
[Remove Redis Pubsub 1/n] Remove enable_gcs_pubsub() (#23189)
GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.
2022-03-15 23:56:15 -07:00
Amog Kamsetty
2548083dcb
[ml] Trainer implementation (#22969)
Implementation for base Trainer

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-03-15 20:35:54 -07:00
Qing Wang
149d06442b
[Core][Java][Remove JVM FullGC 3/N] Disable every 10min FullGC. (#21443)
In this PR, we disabled every 10min FullGC which is not triggered by a global gc event in Java worker. As detail, we added `triggered_by_global_gc` flag to indicate whether the gc event is triggered by a global gc event. If it's triggered by global gc, we still need to do FullGC.

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-03-16 11:18:12 +08:00
Guyang Song
30ae287dac
enable test_runtime_env_working_dir_3.py and fix cache size to be negative (#23183) 2022-03-16 11:00:48 +08:00
qicosmos
d8de5a445a
[C++ Worker]Python call cpp actor (#23061)
[Last PR](https://github.com/ray-project/ray/pull/22820) has supported python call c++ normal task, this PR supports python call c++ actor task.
2022-03-15 19:54:10 -07:00
Edward Oakes
42ebc0a4f6
[serve] Add some test cases for pipeline DAG builder (#23210) 2022-03-15 21:05:12 -05:00
Siyuan (Ryans) Zhuang
499c242f0f
[workflow] More tests for unifying workflow and remote function ObjectRef behavior (#23174)
* add more tests
2022-03-15 16:42:27 -07:00
Antoni Baum
630985e3bb
[ML] XGBoost&LightGBMTrainer interfaces (#23192)
Adds interfaces for `XGBoostTrainer` and `LightGBMTrainer`.
2022-03-15 16:16:30 -07:00
Simon Mo
823dbd06a8
[Serve] Add DeploymentNode implementation on top of existing DAG codebase (#23177) 2022-03-15 16:06:57 -07:00
shrekris-anyscale
57871816d4
[serve] Fix TestGetDeploymentImportPath on Windows (#23201) 2022-03-15 15:48:48 -07:00
Antoni Baum
3625c4760f
[ML/Train] Add TensorflowTrainer interface (#23072)
Interface for TensorflowTrainer

Depends on #22988

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-15 14:02:17 -07:00
siddgoel
0722cbb37e
Add support for snappy text decompression #22298 (#22486)
Adds a streaming based reading option for Snappy-compressed files. Arrow doesn't support streaming Snappy decompression since the canonical C++ Snappy library doesn't natively support streaming decompression. This PR works around this by doing streaming reads of snappy-compressed files using the streaming decompression API provided in the [python-snappy](https://github.com/andrix/python-snappy) package.

This commit supplies a custom datasource that uses Arrow + [python-snappy](https://github.com/andrix/python-snappy) to read and decompress Snappy-compressed files.

Co-authored-by: siddharth.goel <siddharth.goel@bytedance.com>
Co-authored-by: Chen Shen <scv119@gmail.com>
2022-03-15 13:52:22 -07:00
Amog Kamsetty
1572130a4e
[ml/train] Trainer interfaces [4/4]: TorchTrainer interface (#22989)
Interface for TorchTrainer

Depends on #22988
2022-03-15 12:47:44 -07:00
Antoni Baum
a8fbb4accc
[ML] XGBoost&LightGBMPredictor implementation (#23143)
Implementation for XGBoostPredictor & LightGBMPredictor.

The interface has been modified slightly.
2022-03-15 12:44:50 -07:00
Clark Zinzow
1d5f18fe0a
Fix equalized split handling of num_splits == num_blocks case. (#23191) 2022-03-15 12:23:50 -07:00
Yi Cheng
72713e815b
[gcs] Remove use_gcs_for_bootstrap in other python modules. 2022-03-15 12:23:10 -07:00
Siyuan (Ryans) Zhuang
761f927720
[Lint] Cleanup incorrectly formatted strings (Part 2: Tune) (#23129) 2022-03-15 12:17:47 -07:00
Archit Kulkarni
fc182006ec
[Doc] Add missing runtime context namespace doc (#23120)
The public field RuntimeContext.namespace didn't have a docstring so it wasn't showing up at all in the docs. This PR adds a basic docstring.
2022-03-15 11:46:09 -07:00
Balaji Veeramani
c694ed4594
[Train] Add enable_reproducibility (#22851)
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).

These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
2022-03-15 11:07:34 -07:00
xwjiang2010
99d5288bbd
[tune] Better error msg for grpc resource exhausted error. (#22806) 2022-03-15 16:01:40 +00:00
shrekris-anyscale
bf1bd293f4
[serve] Make deployments in Application use only import paths (#23027)
`Application` stores a group of deployments and can write them to a YAML config. However, this requires the deployments to use import paths as their `func_or_class`. This change make all deployments in an `Application` store only import paths as the `func_or_class`.

This change also adds a utility function to get a deployment's import path. This utility function is used in the DeploymentNode for Pipelines.
2022-03-15 10:48:35 -05:00
Amog Kamsetty
e1f24a244b
[ml/train] Training Interfaces [3/4]: DataParallelTrainer interface (#22988)
Interface for DataParallelTrainer and updates to ScalingConfig definition.

Depends on #22986

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-03-15 08:11:05 -07:00
Guyang Song
f65971756d
[dashboard agent] Catch agent port conflict (#23024) 2022-03-15 16:09:15 +08:00
Kai Yang
35c7275bfc
[Object Spilling] Handle IO worker failures correctly (#20752)
Currently, when a spill/restore worker fails and the state of it in the worker pool is idle, the worker pool will not clean up the metadata of the worker. Subsequent spill/restore requests will reuse this dead worker and RPC requests cannot succeed. This results in broken object spilling functionality.

This PR addresses the issue by removing disconnected IO workers from `registered_io_workers` and `idle_io_workers`.
2022-03-15 12:14:14 +08:00
Jules S. Damji
0246f3532e
[DOC] Added a full example how to access deployments (#22401) 2022-03-14 21:15:52 -05:00
Antoni Baum
447a98eed1
[ML] TensorflowPredictor implementation (#23146)
Implementation for TensorflowPredictor.
2022-03-14 17:02:21 -07:00
Archit Kulkarni
5ecd88e2e0
[runtime env] Keep existing PYTHONPATH when using runtime env (#23144) 2022-03-14 18:59:50 -05:00
Stephanie Wang
7235541393
[Datasets] Use multithreading to submit DatasetPipeline stages (#22912)
Previously DatasetPipeline stages were executed by one actor each, which compromised fault tolerance through lineage reconstruction. This centralizes all task submission at the pipeline coordinator to improve fault tolerance. To preserve pipeline parallelism, the stages are executed by a threadpool. To clean up the threadpool, the pipeline coordinator adds any running threads to a global set that is checked by the threads during `ray.wait`.

Note that this will only provide fault tolerance for split pipes if all pipeline consumers stay alive. It will not work if one of the consumers dies and restarts because next_dataset_if_ready is not idempotent.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-03-14 16:57:02 -07:00
Edward Oakes
f646d3fc31
[serve] Add unimplemented interfaces for Deployment DAG APIs (#23125)
Adds the following interfaces (without implementation, for discussion / approval):
- `serve.Application`
- `serve.DeploymentNode`
- `serve.DeploymentMethodNode`, `serve.DAGHandle`, and `serve.drivers.PipelineDriver`
- `serve.run` & `serve.build`

In addition to these Python APIs, we will also support the following CLI commands:
- `serve run [--blocking=true] my_file:my_node_or_app # Uses Ray client, blocking by default.`
- `serve build my_file:my_node output_path.yaml`
- `serve deploy [--blocking=false] # Uses REST API, non-blocking by default.`
- `serve status [--watch=false] # Uses REST API, non-blocking by default.`
2022-03-14 18:53:08 -05:00
Amog Kamsetty
154edce2a4
[ml] Don't require preprocessor in TorchPredictor (#23163) 2022-03-14 16:33:22 -07:00
Antoni Baum
6a1e336b24
[tune] Add CV support for XGB/LGBM Tune callbacks (#22882)
Adds an ability for users to specify a custom results post-processing function that will be applied to metrics before they are reported to Tune in XGBoost/LightGBM integration callbacks, allowing for support for xgb.cv/lgbm.cv. Updates example to show it in action and in CI.
2022-03-14 21:00:39 +00:00
Edward Oakes
5d501e3b28
[serve] Polish help info on the CLI (#23026)
Closes https://github.com/ray-project/ray/issues/23015
2022-03-14 12:38:17 -05:00