Commit graph

6536 commits

Author SHA1 Message Date
Andrew Li
1a293a1187
Providing additional useful messages for JSONDecodeError (#23116)
According to #22535 , I added additional and useful information when encountering the JSONDecodeError.
2022-03-17 20:58:43 -07:00
Guyang Song
1ad019aac3
[C++ API][Doc] Add doc and error log to notice C++ API is not supported on Windows (#23272)
We don't support Windows entirely now.

## Checks

- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2022-03-18 10:52:57 +08:00
Jiajun Yao
62a5404369
Collect more usage stats data (#23167) 2022-03-17 19:33:27 -07:00
Jiao
ea51017e52
[Ray DAG][Serve Pipeline] better error messages on .bind and .remote with tests (#23290) 2022-03-17 18:58:09 -07:00
shrekris-anyscale
1b30bfa972
[serve] Implement set_options (#23265) 2022-03-17 17:09:55 -07:00
Edward Oakes
04ab27dcbf
[serve] Fix ServeHandle JSON Serde (#23285) 2022-03-17 16:35:19 -07:00
Chris K. W
6416c65505
Revert "Revert "[Client] chunked get requests (#22455)"" (#23261)
* revert revertchunkedgets

* exit early if all chunks received, tighter exception handler for stream in proxy
2022-03-17 16:24:30 -07:00
Siyuan (Ryans) Zhuang
f74ad24901
Cleanup nits in code (#23112)
* cleanup code

* fix comments
2022-03-17 15:55:35 -07:00
Amog Kamsetty
d31d6bc9bb
[Docker] Add Train requirements to ray-ml docker image (#22645) 2022-03-17 15:07:32 -07:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Simon Mo
6cc0fee947
[Serve] Improve function deployment API (#23252) 2022-03-17 14:37:43 -07:00
mwtian
1d2d60a2fc
[GCS-Ray] remove Redis password from CLI messages (#23242)
Redis password should not be needed in the connection info printed by `ray start --head`.
We can make another cleanup for removing flags and arguments related to Redis password. But it is a bit more risky (affects external Redis) and needs more care.
2022-03-17 13:36:29 -07:00
Simon Mo
f400b4333a
[Serve] Remove legacy pipeline codebase (#23172) 2022-03-17 13:27:16 -07:00
Antoni Baum
1211c452d4
[ML/Train] TensorflowTrainer implementation (#23250)
Implements `TensorflowTrainer`. Depends on https://github.com/ray-project/ray/pull/23211 (review only files with `tensorflow` in the name).

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-17 11:34:47 -07:00
Siyuan (Ryans) Zhuang
0f61e2f90e
[Lint] Cleanup incorrectly formatted strings (Part 5: util) (#23264) 2022-03-17 10:27:05 -07:00
Antoni Baum
f71e7681b3
[ML] XGBoost&LightGBMTrainer implementation (#23245)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-17 10:00:03 -07:00
Dmitri Gekhtman
c707ad8d73
Fix GCP node termination (#23101)
Skips 404s on node termination for GCP node provider.
Also resets internal "self.nodes_to_terminate" state at the start of an autoscaler iteration -- that's necessary for correct cleanup in the event of failed node termination.
2022-03-17 09:51:16 -07:00
Amog Kamsetty
cf512254bb
[ml/train] Don't create new BackendExecutor actor in Trainable (#23235)
If using the DataParallelTrainer, since we are running the BackendExecutor in a Trainable actor already, we don't need to create a new actor.

However if using Ray Train directly, we still want to run BackendExecutor in an actor for performance with Ray Client.

This PR does some refactoring to support both cases.
2022-03-17 08:31:43 -07:00
xwjiang2010
c12d437fb5
[tune] de-spam some logging. (#23247)
Demoting some logger calls to debug
2022-03-17 15:03:38 +00:00
Siyuan (Ryans) Zhuang
cb80518a80
[Lint] Cleanup incorrectly formatted strings (Part 4: tests, _private) (#23263) 2022-03-17 00:49:16 -07:00
Amog Kamsetty
ef0b85c344
[ml/train] TorchTrainer implementation (#23219) 2022-03-17 00:07:27 -07:00
Gagandeep Singh
c32649b85c
map and map_unordered cancel previous tasks before submitting new ones (#23187)
N.B. - https://github.com/ray-project/ray/issues/23107#issuecomment-1068107507
2022-03-16 23:45:44 -07:00
Siyuan (Ryans) Zhuang
cc1728120f
[Tune] Move resource updater out of trial executor (#23178)
* simplify trial executor

* update test

* fix: proper resource update before initialization

* add test to BUILD

* add doc for resource updater
2022-03-16 22:50:47 -07:00
xwjiang2010
814b49356c
[tuner] Tuner impl. (#22848) 2022-03-16 20:55:30 -07:00
Balaji Veeramani
83986a4d83
[Train] Add support for automatic mixed precision (#22227)
Closes #20643

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-19.us-west-2.compute.internal>
2022-03-16 20:53:02 -07:00
Amog Kamsetty
f33a495b3a
[ml/train] DataParallelTrainer implementation (#23211)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-03-16 19:49:44 -07:00
mwtian
391901f86b
[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233)
Continue to clean up Redis and other related Redis references, for
- gcs_utils.py
- log_monitor.py
- `publish_error_to_driver()`
2022-03-16 19:34:57 -07:00
SangBin Cho
b350fe9ee8
[Nightly test] Fix additional k8s issues + add new tests (#23231)
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
2022-03-16 16:37:29 -07:00
Archit Kulkarni
8707eb6288
[runtime env] Support .whl files in py_modules (#22368)
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster.  One gap in this is if the local Python module is in the form of a wheel (`.whl` file.)  This PR adds the missing support for uploading and installing the `.whl` file.
2022-03-16 16:37:10 -05:00
shrekris-anyscale
84b3de6825
[serve] Add atomic delete (#23195) 2022-03-16 14:13:10 -07:00
Jiao
2bcbe41d54
[Serve] Polish new deployment to DAG binding API with Ray DAG tests (#23208) 2022-03-16 12:59:19 -07:00
Siyuan (Ryans) Zhuang
6d83a3f283
[Lint] Cleanup incorrectly formatted strings (Part 3: components) (#23130) 2022-03-16 12:36:57 -07:00
Edward Oakes
d1a528d6af
[serve] Use deploy_group in serve run and set HTTP options (#23215) 2022-03-16 12:37:21 -05:00
shrekris-anyscale
56ddea85a1
[Serve] Fix typo language (#23213) 2022-03-16 10:14:44 -07:00
shrekris-anyscale
34ebb3409e
[serve] Make Dashboard start Serve in the "serve" namespace (#23198)
The Ray Dashboard starts Serve in the `"_ray_internal_dashboard"` namespace. However, Serve by default starts in the `"serve"` namespace. This causes surprising behavior when working with the Serve CLI and REST API.

This change make the Ray Dashboard start Serve in the `"serve"` namespace, allowing the REST API to work intuitively with the Python API.
2022-03-16 12:03:44 -05:00
Kai Fricke
b80f79a072
[ci/multinode] Improve multi-node tests (#23196)
The current multi node tests use a hardcoded mapping for local development mounts. With this PR, a new environment variable is introduced to be able to control this dynamically. Additionally, some minor improvements to the test utilities and monitor are added.
2022-03-16 09:59:50 +00:00
Siyuan (Ryans) Zhuang
d67c34256b
[Workflow] Optimize out tail recursion in python (#22794)
* add test

* warning when inplace subworkflows may use different resources
2022-03-16 01:51:18 -07:00
Gagandeep Singh
60a3340387
[workflow] Suggestions of correct inputs to create_storage in error message under windows (#23190)
* Provide suggestions of correct inputs to create_storage in error msg

* Applied linting format

* Added test for verifying error message
2022-03-16 01:42:12 -07:00
Siyuan (Ryans) Zhuang
7c43c66b6b
[workflow] Implement workflow continuation unification (#23217)
* implement workflow continuation unification

* fix comments

* fix: strict scope for workflow execution
2022-03-16 00:04:01 -07:00
mwtian
72ef9f91aa
[Remove Redis Pubsub 1/n] Remove enable_gcs_pubsub() (#23189)
GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.
2022-03-15 23:56:15 -07:00
Amog Kamsetty
2548083dcb
[ml] Trainer implementation (#22969)
Implementation for base Trainer

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-03-15 20:35:54 -07:00
Qing Wang
149d06442b
[Core][Java][Remove JVM FullGC 3/N] Disable every 10min FullGC. (#21443)
In this PR, we disabled every 10min FullGC which is not triggered by a global gc event in Java worker. As detail, we added `triggered_by_global_gc` flag to indicate whether the gc event is triggered by a global gc event. If it's triggered by global gc, we still need to do FullGC.

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-03-16 11:18:12 +08:00
Guyang Song
30ae287dac
enable test_runtime_env_working_dir_3.py and fix cache size to be negative (#23183) 2022-03-16 11:00:48 +08:00
qicosmos
d8de5a445a
[C++ Worker]Python call cpp actor (#23061)
[Last PR](https://github.com/ray-project/ray/pull/22820) has supported python call c++ normal task, this PR supports python call c++ actor task.
2022-03-15 19:54:10 -07:00
Edward Oakes
42ebc0a4f6
[serve] Add some test cases for pipeline DAG builder (#23210) 2022-03-15 21:05:12 -05:00
Siyuan (Ryans) Zhuang
499c242f0f
[workflow] More tests for unifying workflow and remote function ObjectRef behavior (#23174)
* add more tests
2022-03-15 16:42:27 -07:00
Antoni Baum
630985e3bb
[ML] XGBoost&LightGBMTrainer interfaces (#23192)
Adds interfaces for `XGBoostTrainer` and `LightGBMTrainer`.
2022-03-15 16:16:30 -07:00
Simon Mo
823dbd06a8
[Serve] Add DeploymentNode implementation on top of existing DAG codebase (#23177) 2022-03-15 16:06:57 -07:00
shrekris-anyscale
57871816d4
[serve] Fix TestGetDeploymentImportPath on Windows (#23201) 2022-03-15 15:48:48 -07:00
Antoni Baum
3625c4760f
[ML/Train] Add TensorflowTrainer interface (#23072)
Interface for TensorflowTrainer

Depends on #22988

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-15 14:02:17 -07:00