Commit graph

6412 commits

Author SHA1 Message Date
Amog Kamsetty
86b79b68be
[ml/train] Training Interfaces [2/4]: Update interface for Trainer (#22986) 2022-03-13 18:09:50 -07:00
Scott Graham
f673acb0ad
Scgraham/azure docs (#22296)
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs

Co-authored-by: Scott Graham <scgraham@microsoft.com>
2022-03-13 18:08:08 -07:00
Antoni Baum
5d3fc5a677
[ML] Add XGBoostPredictor & LightGBMPredictor interfaces (#23073)
Adds `XGBoostPredictor` and `LightGBMPredictor` interfaces.
2022-03-13 15:22:52 -07:00
Antoni Baum
f4ffba8a78
[ML] Add TensorflowPredictor interface (#23070)
Adds interface for TensorflowPredictor.
2022-03-13 15:20:03 -07:00
Siyuan (Ryans) Zhuang
9f607c2165
Revert "Revert "[workflow] Convert DAG to workflow (#22925)"" (#23095)
* Revert "Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)"

This reverts commit 28d597e009.

* rename _bind() -> bind()

* rename _apply_recursive() -> apply_recursive()
2022-03-12 02:08:25 -08:00
Chong-Li
f7e1343d39
[GCS] Fix the normal task resources at GCS (#22857)
* Fix the normal task resources at GCS

* Fix comments

* Leave a TODO

* Bring back a UT

* consider object memory

* Fix

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2022-03-11 21:54:03 -08:00
jon-chuang
0b54d9c780
[GCS] Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762)
Previously, placement group had suboptimal bin-packing resulting in unexpected placement group stalls for users.

The root cause is lack of implementation for sorting of pg bundles by resource priority and size.

This PR implements a naive priority mechanism for bundles that can be improved upon (and even config by user in the future) in the GCS resource scheduler.

The behaviour is to schedule: "GPU" first, custom resources in int64_t order next, and finally, memory and then "CPU" last.
2022-03-11 21:47:07 -08:00
Jiajun Yao
4016dba3d3
Add usage stats heads up message (#22985) 2022-03-11 21:37:22 -08:00
mwtian
aad6f41593
[Tune] Remove unused autogluon requirement (#16587)
`autogluon` does not support Python 3.9. And Ray seems to not import it anywhere.
2022-03-11 16:54:23 -08:00
Amog Kamsetty
2294a7ed47
[ml] TorchPredictor interface (#22990) 2022-03-11 16:00:53 -08:00
Siyuan (Ryans) Zhuang
be7ccb7dac
[core][serialization] Fix registering serializer before initializing Ray. (#23031)
* Support registering serializer before initializing Ray.

* add test
2022-03-11 15:13:18 -08:00
Yi Cheng
4f86b5b523
[gcs] Remove use_gcs_for_bootstrap in core (python) and autoscaler (#23050)
This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.
2022-03-11 14:36:16 -08:00
Peng Yu
252ba6cecd
Correct documentation in ActorPoolStrategy (#23079) 2022-03-11 13:27:55 -08:00
Simon Mo
2f2fc97bd1
Don't symlink Serve in setup-dev (#23092) 2022-03-11 13:21:00 -08:00
Jian Xiao
e9ae784e62
Make schema() read non-disruptive to iter_datasets() (#23032)
Currently, reading schema of DatasetPipeline is disruptive and will invalidate the iter_datasets().
2022-03-11 12:01:24 -08:00
Patrick Ames
1d48c8dc75
[Datasets] Support dataset metadata provider callbacks in read APIs. (#22896)
These changes add Dataset Read API support for (1) specifying custom block metadata provider callbacks, and (2) skipping path expansion. When paired with a custom block metadata provider that maintains an in-memory cache of BlockMetadata for each input file path, these changes reduced average S3-based dataset read times for production [Redshift Manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) stored in Amazon's internal data catalog by over 90%.  A simple ParquetDatasource benchmark reading 144MM records across 100 ~70MiB (on-disk) Parquet files stored in S3 showed an ~75% reduction in read latency (from 4.62 seconds to 1.18 seconds on 2 r5n.8xlarge EC2 nodes).
2022-03-11 11:52:56 -08:00
xwjiang2010
5d776b00e6
[tuner] fix result_grid (#23078) 2022-03-11 11:34:44 -08:00
xwjiang2010
f270d84094
[AIR] switch to a common RunConfig. (#23076) 2022-03-11 10:55:36 -08:00
Stephanie Wang
28d597e009
Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)
This reverts commit 0a9f966e63.
2022-03-11 09:49:08 -08:00
shrekris-anyscale
665bdbff47
[serve] Exclude unset fields from Ray actor options (#23059)
The `schema_to_deployment()` function preserve unset fields with unexpected default argument types. This change excludes unset fields in that function and also changes the dictionaries' default values to empty dicts.
2022-03-11 10:45:21 -06:00
Kenneth
07372927cc
Enable buffering and spilling to multiple remote storages (#22798)
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.

In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.

Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513

Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.

For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.

Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742


Together these changes enable faster large-scale object spilling.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-11 11:27:02 -05:00
Kai Fricke
61295f8b58
[ml/checkpoint] Fix checkpoint location on remote node (#23068)
Currently breaks tests where the checkpoint is stored on a remote node (e.g. via Ray client), e.g.: https://buildkite.com/ray-project/release-tests-branch/builds/132#6a4936a8-41dd-4fd2-9f02-976855cbd9b7
Instead, we can set the properties manually.
In the future, we need a story on how to refer to checkpoints kept on remote nodes.
2022-03-11 15:38:21 +00:00
Jialing He
0cbbb8c1d0
[runtime env][core] Use Proto message RuntimeEnvInfo between user code and core_worker (#22856) 2022-03-11 22:14:18 +08:00
Kai Fricke
aed17dd346
Revert "Revert "[ml/tune] Expose new checkpoint interface to users (#22741)" (#23006)" (#23009)
This reverts commit 85598d9d10.

Test breakage was unrelated.
2022-03-11 09:51:41 +00:00
Jialing He
0c5440ee72
[runtime env] Deletes the proto cache on RuntimeEnv (#22944)
Mainly the following things:
- This PR deletes the proto cache on RuntimeEnv, ensuring that the user's modification of RuntimeEnv can take effect in the Proto message.
- validate whole runtime env when serialize runtime_env. 
- overload method `__setitem__` to parse and validate field when it has to modify.
2022-03-11 15:37:18 +08:00
matthewdeng
3a3a7b4be4
[test] add back deleted datasets train test file (#23051) 2022-03-10 21:46:07 -08:00
Amog Kamsetty
f80602b7d2
[Datasets] Separate pandas to torch conversion in to_torch (#22939)
Separate out the conversion of pandas dataframe to torch tensor in a utility function so that the same logic can be used in other places in Ray ML (for example during inference).
2022-03-10 20:40:01 -08:00
xwjiang2010
4b28bc3f09
[Tuner part1] Add Tuner interface. (#22975) 2022-03-10 19:55:59 -08:00
Siyuan (Ryans) Zhuang
0a9f966e63
[workflow] Convert DAG to workflow (#22925)
* convert DAG to a workflow

* deduplicate

* check duplication of steps

* add test for object refs
2022-03-10 19:40:14 -08:00
Eric Liang
148eaeac2e
[minor] Leave a big of wiggle room when calculating shared memory max (#23034) 2022-03-10 17:37:26 -08:00
Amog Kamsetty
9bd00f3e1a
[ml/train] Remove ConvertibleToTrainable and move Trainer to ray.ml.trainer (#23030)
As discussed,

- Removes ConvertibleToTrainable interface and makes as_trainable part of the Trainer interface
- Moves Trainer interface to ray.ml.trainer from ray.ml.train.trainer
2022-03-10 15:24:58 -08:00
Edward Oakes
5a18802ad7
[serve] Remove runtime-env arg from serve start (#23017) 2022-03-10 15:15:59 -06:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Amog Kamsetty
a5f41b2c9f
[ml/train] Training Interfaces [1/4]: Ray AIR Trainer interface (#22980) 2022-03-10 13:12:44 -08:00
Guyang Song
3d9f214833
[runtime env] Fix import in subprocess when using pip in runtime_env (#22983)
Fix the issue https://github.com/ray-project/ray/issues/22968
2022-03-10 15:11:41 -06:00
xwjiang2010
b1496d235f
[tune] fix error handling for fail_fast case. (#22982) 2022-03-10 20:10:05 +00:00
Simon Mo
832354ce3f
[Serve] Compatibility bridge between model wrappers and pipeline (#22995) 2022-03-10 11:52:03 -08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
Yi Cheng
bb5fa6b851
Remove redis in setup.py (#22979) 2022-03-10 11:05:03 -08:00
Archit Kulkarni
c78bd809ce
[job submission] Support local py_modules in jobs (#22843) 2022-03-10 11:42:25 -06:00
Stephanie Wang
85598d9d10
Revert "[ml/tune] Expose new checkpoint interface to users (#22741)" (#23006)
This reverts commit e9692a2a80.
2022-03-10 17:07:44 +00:00
shrekris-anyscale
1100c98222
[serve] Implement Serve Application object (#22917)
The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.
2022-03-10 10:28:29 -06:00
Jiajun Yao
2e828cc9e1
Delete dead test_setup_worker.py (#22970)
The tested code is dead so we can remove the code and the test.
2022-03-10 07:20:41 -08:00
Antoni Baum
bf49d37176
[tune] Add Trainable.postprocess_checkpoint (#22973)
Adds postprocess_checkpoint method to Trainable to facilitate the checkpointing of preprocessors in AIR.
2022-03-10 12:14:39 +00:00
Tao Wang
bc14512471
[Hotfix]Fix test_actor failure caused by interface change (#23000) 2022-03-10 19:34:12 +08:00
Kai Fricke
e9692a2a80
[ml/tune] Expose new checkpoint interface to users (#22741)
This PR exposes the new checkpoint interface, implemented in #22691, to end users. It does this by replacing the old external facing TrialCheckpoint class with a merged class that supports the old TrialCheckpoint API (upload, download, save) as well as the new Checkpoint API.

With this PR, users can use the new Checkpoint interface for downstream processing of their Ray Tune results. In a follow-up PR, the new Checkpoint interface will be used internally within Ray Tune and Train for bookkeeping, however, that is not required to unblock the Ray ML use case.
2022-03-10 10:20:24 +00:00
kyle-chen-uber
592656ca28
[horovod] remove deprecated slot concept, use worker instead (#22708)
Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor.
horovod/horovod@a729ba7

The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune.

This commit updates the Horovod RayExecutor init args.

Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-03-10 08:16:42 +00:00
shrekris-anyscale
bc82e2d5c4
[serve] Restore "[serve] Support working_dir in serve run (#22760)" (#22971) 2022-03-09 21:31:23 -08:00
Dmitri Gekhtman
19b4281991
[KubeRay] Pin autoscaler image (#22987)
Sets the autoscaler image to the one from this PR's commit.
#22847
2022-03-09 20:38:37 -08:00
Dmitri Gekhtman
413fe08f87
Move KubeRay autoscaler files into Ray autoscaler directory, add an entry-point. (#22847)
This PR consists of the following clean-up items for KubeRay autoscaler integration:

Remove the docker/kuberay directory

Move the Python files formerly in docker/kuberay to the autoscaler directory.

Use a rayproject/ray image for the autoscaler.

Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.

Slightly simplify the code that starts the autoscaler.

Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.

By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.

Add the autoscaler configuration test to the CI.

Update development documentation to reflect the changes in this PR.
2022-03-09 18:26:57 -08:00