Commit graph

11709 commits

Author SHA1 Message Date
shrekris-anyscale
57871816d4
[serve] Fix TestGetDeploymentImportPath on Windows (#23201) 2022-03-15 15:48:48 -07:00
Tomas Babej
7a1d10a3d0
[Job submission] Set headers when establishing websocket (#23111) 2022-03-15 16:20:44 -05:00
Antoni Baum
3625c4760f
[ML/Train] Add TensorflowTrainer interface (#23072)
Interface for TensorflowTrainer

Depends on #22988

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-15 14:02:17 -07:00
siddgoel
0722cbb37e
Add support for snappy text decompression #22298 (#22486)
Adds a streaming based reading option for Snappy-compressed files. Arrow doesn't support streaming Snappy decompression since the canonical C++ Snappy library doesn't natively support streaming decompression. This PR works around this by doing streaming reads of snappy-compressed files using the streaming decompression API provided in the [python-snappy](https://github.com/andrix/python-snappy) package.

This commit supplies a custom datasource that uses Arrow + [python-snappy](https://github.com/andrix/python-snappy) to read and decompress Snappy-compressed files.

Co-authored-by: siddharth.goel <siddharth.goel@bytedance.com>
Co-authored-by: Chen Shen <scv119@gmail.com>
2022-03-15 13:52:22 -07:00
Eric Liang
ca1100397e
Update paper links to include exoshuffle and remove whitepaper (moved to docs) (#23099) 2022-03-15 13:12:01 -07:00
Amog Kamsetty
1572130a4e
[ml/train] Trainer interfaces [4/4]: TorchTrainer interface (#22989)
Interface for TorchTrainer

Depends on #22988
2022-03-15 12:47:44 -07:00
Antoni Baum
a8fbb4accc
[ML] XGBoost&LightGBMPredictor implementation (#23143)
Implementation for XGBoostPredictor & LightGBMPredictor.

The interface has been modified slightly.
2022-03-15 12:44:50 -07:00
Clark Zinzow
1d5f18fe0a
Fix equalized split handling of num_splits == num_blocks case. (#23191) 2022-03-15 12:23:50 -07:00
Yi Cheng
72713e815b
[gcs] Remove use_gcs_for_bootstrap in other python modules. 2022-03-15 12:23:10 -07:00
Siyuan (Ryans) Zhuang
761f927720
[Lint] Cleanup incorrectly formatted strings (Part 2: Tune) (#23129) 2022-03-15 12:17:47 -07:00
Archit Kulkarni
fc182006ec
[Doc] Add missing runtime context namespace doc (#23120)
The public field RuntimeContext.namespace didn't have a docstring so it wasn't showing up at all in the docs. This PR adds a basic docstring.
2022-03-15 11:46:09 -07:00
Balaji Veeramani
c694ed4594
[Train] Add enable_reproducibility (#22851)
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).

These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
2022-03-15 11:07:34 -07:00
Siyuan (Ryans) Zhuang
0c74ecad12
[Lint] Cleanup incorrectly formatted strings (Part 1: RLLib). (#23128) 2022-03-15 17:34:21 +01:00
xwjiang2010
99d5288bbd
[tune] Better error msg for grpc resource exhausted error. (#22806) 2022-03-15 16:01:40 +00:00
shrekris-anyscale
bf1bd293f4
[serve] Make deployments in Application use only import paths (#23027)
`Application` stores a group of deployments and can write them to a YAML config. However, this requires the deployments to use import paths as their `func_or_class`. This change make all deployments in an `Application` store only import paths as the `func_or_class`.

This change also adds a utility function to get a deployment's import path. This utility function is used in the DeploymentNode for Pipelines.
2022-03-15 10:48:35 -05:00
Fabien Couthouis
e575ed3350
[RLlib] Fix AttributeError with None obs shape + tf in _unpack_obs() utility (#22428) 2022-03-15 16:34:31 +01:00
Amog Kamsetty
e1f24a244b
[ml/train] Training Interfaces [3/4]: DataParallelTrainer interface (#22988)
Interface for DataParallelTrainer and updates to ScalingConfig definition.

Depends on #22986

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-03-15 08:11:05 -07:00
Qing Wang
f51cb09e02
[Core][Java][Remove JVM FullGC 2/N] Make JVM be aware of in-memory store pressure. (#21441) 2022-03-15 19:25:27 +08:00
Max Pumperla
ad30123339
[docs] fix includes for md files (#23180)
the include of content for md files like our central getting started page didn't render. fixed here.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-15 11:09:18 +00:00
Pamphile Roy
81b17669a4
[core][docs] Document port/IP binding and slurm concerns (#22663)
Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later).

I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.
2022-03-15 01:43:46 -07:00
Guyang Song
f65971756d
[dashboard agent] Catch agent port conflict (#23024) 2022-03-15 16:09:15 +08:00
Chen Shen
5a2ebc281c
[Scheduler] separate scheduler code to its own build target (#23124)
* wip

* comments

* fix build

* fix-test

* fix format
2022-03-14 23:23:58 -07:00
Kai Yang
35c7275bfc
[Object Spilling] Handle IO worker failures correctly (#20752)
Currently, when a spill/restore worker fails and the state of it in the worker pool is idle, the worker pool will not clean up the metadata of the worker. Subsequent spill/restore requests will reuse this dead worker and RPC requests cannot succeed. This results in broken object spilling functionality.

This PR addresses the issue by removing disconnected IO workers from `registered_io_workers` and `idle_io_workers`.
2022-03-15 12:14:14 +08:00
Kai Yang
041f98d5dd
Fix or remove unnecessary action_env settings in .bazelrc (#21307)
`PATH` is easy to be changed in a terminal session. Different `$PATH` values lead to miss of bazel cache. e.g. `pip install python -e` and `bazel build //:all` don't share cache because Python modifies `PATH`.

`LC_ALL`, `LANG`, and python-related environment variables are only used by C++ worker tests, which invokes the `ray start` command when running tests with `bazel test`. Java worker is not affected because we don't use `bazel test` to run Java tests. So these env variables should stay `test_env`, not `action_env`.

This PR can greatly improve the cache hit rate of Bazel build and test.
2022-03-15 12:13:13 +08:00
Jules S. Damji
0246f3532e
[DOC] Added a full example how to access deployments (#22401) 2022-03-14 21:15:52 -05:00
mwtian
6eb805b357
[CI] remove GCS-Ray CI tests (#23149)
* remove redis ci tests

* remove mac
2022-03-14 18:18:59 -07:00
Antoni Baum
447a98eed1
[ML] TensorflowPredictor implementation (#23146)
Implementation for TensorflowPredictor.
2022-03-14 17:02:21 -07:00
Archit Kulkarni
5ecd88e2e0
[runtime env] Keep existing PYTHONPATH when using runtime env (#23144) 2022-03-14 18:59:50 -05:00
Stephanie Wang
7235541393
[Datasets] Use multithreading to submit DatasetPipeline stages (#22912)
Previously DatasetPipeline stages were executed by one actor each, which compromised fault tolerance through lineage reconstruction. This centralizes all task submission at the pipeline coordinator to improve fault tolerance. To preserve pipeline parallelism, the stages are executed by a threadpool. To clean up the threadpool, the pipeline coordinator adds any running threads to a global set that is checked by the threads during `ray.wait`.

Note that this will only provide fault tolerance for split pipes if all pipeline consumers stay alive. It will not work if one of the consumers dies and restarts because next_dataset_if_ready is not idempotent.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-03-14 16:57:02 -07:00
Edward Oakes
f646d3fc31
[serve] Add unimplemented interfaces for Deployment DAG APIs (#23125)
Adds the following interfaces (without implementation, for discussion / approval):
- `serve.Application`
- `serve.DeploymentNode`
- `serve.DeploymentMethodNode`, `serve.DAGHandle`, and `serve.drivers.PipelineDriver`
- `serve.run` & `serve.build`

In addition to these Python APIs, we will also support the following CLI commands:
- `serve run [--blocking=true] my_file:my_node_or_app # Uses Ray client, blocking by default.`
- `serve build my_file:my_node output_path.yaml`
- `serve deploy [--blocking=false] # Uses REST API, non-blocking by default.`
- `serve status [--watch=false] # Uses REST API, non-blocking by default.`
2022-03-14 18:53:08 -05:00
Amog Kamsetty
154edce2a4
[ml] Don't require preprocessor in TorchPredictor (#23163) 2022-03-14 16:33:22 -07:00
Antoni Baum
6a1e336b24
[tune] Add CV support for XGB/LGBM Tune callbacks (#22882)
Adds an ability for users to specify a custom results post-processing function that will be applied to metrics before they are reported to Tune in XGBoost/LightGBM integration callbacks, allowing for support for xgb.cv/lgbm.cv. Updates example to show it in action and in CI.
2022-03-14 21:00:39 +00:00
Archit Kulkarni
e8496374e2
[Jobs] Test job submit with no specified ray address (#23119) 2022-03-14 13:44:06 -05:00
Edward Oakes
5d501e3b28
[serve] Polish help info on the CLI (#23026)
Closes https://github.com/ray-project/ray/issues/23015
2022-03-14 12:38:17 -05:00
Amog Kamsetty
7dcba48034
[ml] TorchPredictor implementation (#23123)
Implementation for TorchPredictor.
2022-03-14 10:28:22 -07:00
Kai Fricke
15aeb33e50
[ci/release] Support PR wheels (#23084)
This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.
2022-03-14 17:24:13 +00:00
Jialing He
39a6c054d3
[runtime env][feature] introduce pip_check_enable and pip_version (#22826) 2022-03-14 23:41:19 +08:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Kai Fricke
d93fa95dd5
[ci/release] Only report results for scheduled builds (#23135)
Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.
2022-03-14 15:10:16 +00:00
Kai Fricke
fce49694fc
[ci/release] Disable infra retries for now (#23132)
Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now.
We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.
2022-03-14 11:51:11 +00:00
Kai Fricke
830238cce2
[ci/release] Migrate ML user tests (#22953)
Most recent tests:

https://buildkite.com/ray-project/release-tests-branch/builds/156
https://buildkite.com/ray-project/release-tests-branch/builds/158
2022-03-14 11:50:16 +00:00
SangBin Cho
2c2d96eeb1
[Nightly tests] Improve k8s testing (#23108)
This PR improves broken k8s tests.

Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
2022-03-14 03:49:15 -07:00
Jiaxin Shan
8823ca48b4
[Workflow] Improve workflow docs (#23114)
* [Workflow] Improve workflow docs

* Update doc/source/workflows/concepts.rst

Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
2022-03-13 18:55:45 -07:00
Jiajun Yao
e4620669a1
[Release Test] Add perf metrics for core scalability tests (#23110)
* Add perf metrics for core scalability tests

* lint
2022-03-14 10:20:39 +09:00
Amog Kamsetty
86b79b68be
[ml/train] Training Interfaces [2/4]: Update interface for Trainer (#22986) 2022-03-13 18:09:50 -07:00
Scott Graham
f673acb0ad
Scgraham/azure docs (#22296)
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs

Co-authored-by: Scott Graham <scgraham@microsoft.com>
2022-03-13 18:08:08 -07:00
Antoni Baum
5d3fc5a677
[ML] Add XGBoostPredictor & LightGBMPredictor interfaces (#23073)
Adds `XGBoostPredictor` and `LightGBMPredictor` interfaces.
2022-03-13 15:22:52 -07:00
Antoni Baum
f4ffba8a78
[ML] Add TensorflowPredictor interface (#23070)
Adds interface for TensorflowPredictor.
2022-03-13 15:20:03 -07:00
Kai Fricke
430ea3e636
[ci/release] Migrate golden notebook tests (#22949)
Migrating golden notebook tests to new release test package.
Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155
2022-03-13 21:39:41 +00:00
Kai Fricke
956ad95d67
[ci/release] Fix release test config (#23122)
Currently the test is failing due to an invalid config (merged before validation was properly enforced).
2022-03-13 19:48:34 +00:00