Commit graph

6285 commits

Author SHA1 Message Date
Amog Kamsetty
1572130a4e
[ml/train] Trainer interfaces [4/4]: TorchTrainer interface (#22989)
Interface for TorchTrainer

Depends on #22988
2022-03-15 12:47:44 -07:00
Antoni Baum
a8fbb4accc
[ML] XGBoost&LightGBMPredictor implementation (#23143)
Implementation for XGBoostPredictor & LightGBMPredictor.

The interface has been modified slightly.
2022-03-15 12:44:50 -07:00
Clark Zinzow
1d5f18fe0a
Fix equalized split handling of num_splits == num_blocks case. (#23191) 2022-03-15 12:23:50 -07:00
Yi Cheng
72713e815b
[gcs] Remove use_gcs_for_bootstrap in other python modules. 2022-03-15 12:23:10 -07:00
Siyuan (Ryans) Zhuang
761f927720
[Lint] Cleanup incorrectly formatted strings (Part 2: Tune) (#23129) 2022-03-15 12:17:47 -07:00
Archit Kulkarni
fc182006ec
[Doc] Add missing runtime context namespace doc (#23120)
The public field RuntimeContext.namespace didn't have a docstring so it wasn't showing up at all in the docs. This PR adds a basic docstring.
2022-03-15 11:46:09 -07:00
Balaji Veeramani
c694ed4594
[Train] Add enable_reproducibility (#22851)
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).

These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
2022-03-15 11:07:34 -07:00
xwjiang2010
99d5288bbd
[tune] Better error msg for grpc resource exhausted error. (#22806) 2022-03-15 16:01:40 +00:00
shrekris-anyscale
bf1bd293f4
[serve] Make deployments in Application use only import paths (#23027)
`Application` stores a group of deployments and can write them to a YAML config. However, this requires the deployments to use import paths as their `func_or_class`. This change make all deployments in an `Application` store only import paths as the `func_or_class`.

This change also adds a utility function to get a deployment's import path. This utility function is used in the DeploymentNode for Pipelines.
2022-03-15 10:48:35 -05:00
Amog Kamsetty
e1f24a244b
[ml/train] Training Interfaces [3/4]: DataParallelTrainer interface (#22988)
Interface for DataParallelTrainer and updates to ScalingConfig definition.

Depends on #22986

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-03-15 08:11:05 -07:00
Guyang Song
f65971756d
[dashboard agent] Catch agent port conflict (#23024) 2022-03-15 16:09:15 +08:00
Kai Yang
35c7275bfc
[Object Spilling] Handle IO worker failures correctly (#20752)
Currently, when a spill/restore worker fails and the state of it in the worker pool is idle, the worker pool will not clean up the metadata of the worker. Subsequent spill/restore requests will reuse this dead worker and RPC requests cannot succeed. This results in broken object spilling functionality.

This PR addresses the issue by removing disconnected IO workers from `registered_io_workers` and `idle_io_workers`.
2022-03-15 12:14:14 +08:00
Jules S. Damji
0246f3532e
[DOC] Added a full example how to access deployments (#22401) 2022-03-14 21:15:52 -05:00
Antoni Baum
447a98eed1
[ML] TensorflowPredictor implementation (#23146)
Implementation for TensorflowPredictor.
2022-03-14 17:02:21 -07:00
Archit Kulkarni
5ecd88e2e0
[runtime env] Keep existing PYTHONPATH when using runtime env (#23144) 2022-03-14 18:59:50 -05:00
Stephanie Wang
7235541393
[Datasets] Use multithreading to submit DatasetPipeline stages (#22912)
Previously DatasetPipeline stages were executed by one actor each, which compromised fault tolerance through lineage reconstruction. This centralizes all task submission at the pipeline coordinator to improve fault tolerance. To preserve pipeline parallelism, the stages are executed by a threadpool. To clean up the threadpool, the pipeline coordinator adds any running threads to a global set that is checked by the threads during `ray.wait`.

Note that this will only provide fault tolerance for split pipes if all pipeline consumers stay alive. It will not work if one of the consumers dies and restarts because next_dataset_if_ready is not idempotent.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-03-14 16:57:02 -07:00
Edward Oakes
f646d3fc31
[serve] Add unimplemented interfaces for Deployment DAG APIs (#23125)
Adds the following interfaces (without implementation, for discussion / approval):
- `serve.Application`
- `serve.DeploymentNode`
- `serve.DeploymentMethodNode`, `serve.DAGHandle`, and `serve.drivers.PipelineDriver`
- `serve.run` & `serve.build`

In addition to these Python APIs, we will also support the following CLI commands:
- `serve run [--blocking=true] my_file:my_node_or_app # Uses Ray client, blocking by default.`
- `serve build my_file:my_node output_path.yaml`
- `serve deploy [--blocking=false] # Uses REST API, non-blocking by default.`
- `serve status [--watch=false] # Uses REST API, non-blocking by default.`
2022-03-14 18:53:08 -05:00
Amog Kamsetty
154edce2a4
[ml] Don't require preprocessor in TorchPredictor (#23163) 2022-03-14 16:33:22 -07:00
Antoni Baum
6a1e336b24
[tune] Add CV support for XGB/LGBM Tune callbacks (#22882)
Adds an ability for users to specify a custom results post-processing function that will be applied to metrics before they are reported to Tune in XGBoost/LightGBM integration callbacks, allowing for support for xgb.cv/lgbm.cv. Updates example to show it in action and in CI.
2022-03-14 21:00:39 +00:00
Edward Oakes
5d501e3b28
[serve] Polish help info on the CLI (#23026)
Closes https://github.com/ray-project/ray/issues/23015
2022-03-14 12:38:17 -05:00
Amog Kamsetty
7dcba48034
[ml] TorchPredictor implementation (#23123)
Implementation for TorchPredictor.
2022-03-14 10:28:22 -07:00
Jialing He
39a6c054d3
[runtime env][feature] introduce pip_check_enable and pip_version (#22826) 2022-03-14 23:41:19 +08:00
SangBin Cho
2c2d96eeb1
[Nightly tests] Improve k8s testing (#23108)
This PR improves broken k8s tests.

Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
2022-03-14 03:49:15 -07:00
Amog Kamsetty
86b79b68be
[ml/train] Training Interfaces [2/4]: Update interface for Trainer (#22986) 2022-03-13 18:09:50 -07:00
Scott Graham
f673acb0ad
Scgraham/azure docs (#22296)
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs

Co-authored-by: Scott Graham <scgraham@microsoft.com>
2022-03-13 18:08:08 -07:00
Antoni Baum
5d3fc5a677
[ML] Add XGBoostPredictor & LightGBMPredictor interfaces (#23073)
Adds `XGBoostPredictor` and `LightGBMPredictor` interfaces.
2022-03-13 15:22:52 -07:00
Antoni Baum
f4ffba8a78
[ML] Add TensorflowPredictor interface (#23070)
Adds interface for TensorflowPredictor.
2022-03-13 15:20:03 -07:00
Siyuan (Ryans) Zhuang
9f607c2165
Revert "Revert "[workflow] Convert DAG to workflow (#22925)"" (#23095)
* Revert "Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)"

This reverts commit 28d597e009.

* rename _bind() -> bind()

* rename _apply_recursive() -> apply_recursive()
2022-03-12 02:08:25 -08:00
Chong-Li
f7e1343d39
[GCS] Fix the normal task resources at GCS (#22857)
* Fix the normal task resources at GCS

* Fix comments

* Leave a TODO

* Bring back a UT

* consider object memory

* Fix

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2022-03-11 21:54:03 -08:00
jon-chuang
0b54d9c780
[GCS] Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762)
Previously, placement group had suboptimal bin-packing resulting in unexpected placement group stalls for users.

The root cause is lack of implementation for sorting of pg bundles by resource priority and size.

This PR implements a naive priority mechanism for bundles that can be improved upon (and even config by user in the future) in the GCS resource scheduler.

The behaviour is to schedule: "GPU" first, custom resources in int64_t order next, and finally, memory and then "CPU" last.
2022-03-11 21:47:07 -08:00
Jiajun Yao
4016dba3d3
Add usage stats heads up message (#22985) 2022-03-11 21:37:22 -08:00
mwtian
aad6f41593
[Tune] Remove unused autogluon requirement (#16587)
`autogluon` does not support Python 3.9. And Ray seems to not import it anywhere.
2022-03-11 16:54:23 -08:00
Amog Kamsetty
2294a7ed47
[ml] TorchPredictor interface (#22990) 2022-03-11 16:00:53 -08:00
Siyuan (Ryans) Zhuang
be7ccb7dac
[core][serialization] Fix registering serializer before initializing Ray. (#23031)
* Support registering serializer before initializing Ray.

* add test
2022-03-11 15:13:18 -08:00
Yi Cheng
4f86b5b523
[gcs] Remove use_gcs_for_bootstrap in core (python) and autoscaler (#23050)
This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.
2022-03-11 14:36:16 -08:00
Peng Yu
252ba6cecd
Correct documentation in ActorPoolStrategy (#23079) 2022-03-11 13:27:55 -08:00
Simon Mo
2f2fc97bd1
Don't symlink Serve in setup-dev (#23092) 2022-03-11 13:21:00 -08:00
Jian Xiao
e9ae784e62
Make schema() read non-disruptive to iter_datasets() (#23032)
Currently, reading schema of DatasetPipeline is disruptive and will invalidate the iter_datasets().
2022-03-11 12:01:24 -08:00
Patrick Ames
1d48c8dc75
[Datasets] Support dataset metadata provider callbacks in read APIs. (#22896)
These changes add Dataset Read API support for (1) specifying custom block metadata provider callbacks, and (2) skipping path expansion. When paired with a custom block metadata provider that maintains an in-memory cache of BlockMetadata for each input file path, these changes reduced average S3-based dataset read times for production [Redshift Manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) stored in Amazon's internal data catalog by over 90%.  A simple ParquetDatasource benchmark reading 144MM records across 100 ~70MiB (on-disk) Parquet files stored in S3 showed an ~75% reduction in read latency (from 4.62 seconds to 1.18 seconds on 2 r5n.8xlarge EC2 nodes).
2022-03-11 11:52:56 -08:00
xwjiang2010
5d776b00e6
[tuner] fix result_grid (#23078) 2022-03-11 11:34:44 -08:00
xwjiang2010
f270d84094
[AIR] switch to a common RunConfig. (#23076) 2022-03-11 10:55:36 -08:00
Stephanie Wang
28d597e009
Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)
This reverts commit 0a9f966e63.
2022-03-11 09:49:08 -08:00
shrekris-anyscale
665bdbff47
[serve] Exclude unset fields from Ray actor options (#23059)
The `schema_to_deployment()` function preserve unset fields with unexpected default argument types. This change excludes unset fields in that function and also changes the dictionaries' default values to empty dicts.
2022-03-11 10:45:21 -06:00
Kenneth
07372927cc
Enable buffering and spilling to multiple remote storages (#22798)
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.

In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.

Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513

Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.

For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.

Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742


Together these changes enable faster large-scale object spilling.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-11 11:27:02 -05:00
Kai Fricke
61295f8b58
[ml/checkpoint] Fix checkpoint location on remote node (#23068)
Currently breaks tests where the checkpoint is stored on a remote node (e.g. via Ray client), e.g.: https://buildkite.com/ray-project/release-tests-branch/builds/132#6a4936a8-41dd-4fd2-9f02-976855cbd9b7
Instead, we can set the properties manually.
In the future, we need a story on how to refer to checkpoints kept on remote nodes.
2022-03-11 15:38:21 +00:00
Jialing He
0cbbb8c1d0
[runtime env][core] Use Proto message RuntimeEnvInfo between user code and core_worker (#22856) 2022-03-11 22:14:18 +08:00
Kai Fricke
aed17dd346
Revert "Revert "[ml/tune] Expose new checkpoint interface to users (#22741)" (#23006)" (#23009)
This reverts commit 85598d9d10.

Test breakage was unrelated.
2022-03-11 09:51:41 +00:00
Jialing He
0c5440ee72
[runtime env] Deletes the proto cache on RuntimeEnv (#22944)
Mainly the following things:
- This PR deletes the proto cache on RuntimeEnv, ensuring that the user's modification of RuntimeEnv can take effect in the Proto message.
- validate whole runtime env when serialize runtime_env. 
- overload method `__setitem__` to parse and validate field when it has to modify.
2022-03-11 15:37:18 +08:00
matthewdeng
3a3a7b4be4
[test] add back deleted datasets train test file (#23051) 2022-03-10 21:46:07 -08:00
Amog Kamsetty
f80602b7d2
[Datasets] Separate pandas to torch conversion in to_torch (#22939)
Separate out the conversion of pandas dataframe to torch tensor in a utility function so that the same logic can be used in other places in Ray ML (for example during inference).
2022-03-10 20:40:01 -08:00