Commit graph

11649 commits

Author SHA1 Message Date
mwtian
aad6f41593
[Tune] Remove unused autogluon requirement (#16587)
`autogluon` does not support Python 3.9. And Ray seems to not import it anywhere.
2022-03-11 16:54:23 -08:00
SangBin Cho
97383e4c1b
[Nightly test] Fix a broken nightly test due to the wrong config (#23097) 2022-03-11 16:47:06 -08:00
Amog Kamsetty
2294a7ed47
[ml] TorchPredictor interface (#22990) 2022-03-11 16:00:53 -08:00
Siyuan (Ryans) Zhuang
be7ccb7dac
[core][serialization] Fix registering serializer before initializing Ray. (#23031)
* Support registering serializer before initializing Ray.

* add test
2022-03-11 15:13:18 -08:00
Yi Cheng
4f86b5b523
[gcs] Remove use_gcs_for_bootstrap in core (python) and autoscaler (#23050)
This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.
2022-03-11 14:36:16 -08:00
Peng Yu
252ba6cecd
Correct documentation in ActorPoolStrategy (#23079) 2022-03-11 13:27:55 -08:00
Simon Mo
2f2fc97bd1
Don't symlink Serve in setup-dev (#23092) 2022-03-11 13:21:00 -08:00
Jeroen Bédorf
bc21a4593d
[RLlib] Fix crash when kl_coeff is set to 0 (#23063)
Co-authored-by: Jeroen Bédorf <jeroen@minds.ai>
Co-authored-by: Ishant Mrinal Haloi <mrinal.haloi11@gmail.com>
Co-authored-by: Ishant Mrinal <33053278+n30111@users.noreply.github.com>
2022-03-11 12:24:52 -08:00
Jian Xiao
e9ae784e62
Make schema() read non-disruptive to iter_datasets() (#23032)
Currently, reading schema of DatasetPipeline is disruptive and will invalidate the iter_datasets().
2022-03-11 12:01:24 -08:00
Patrick Ames
1d48c8dc75
[Datasets] Support dataset metadata provider callbacks in read APIs. (#22896)
These changes add Dataset Read API support for (1) specifying custom block metadata provider callbacks, and (2) skipping path expansion. When paired with a custom block metadata provider that maintains an in-memory cache of BlockMetadata for each input file path, these changes reduced average S3-based dataset read times for production [Redshift Manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) stored in Amazon's internal data catalog by over 90%.  A simple ParquetDatasource benchmark reading 144MM records across 100 ~70MiB (on-disk) Parquet files stored in S3 showed an ~75% reduction in read latency (from 4.62 seconds to 1.18 seconds on 2 r5n.8xlarge EC2 nodes).
2022-03-11 11:52:56 -08:00
xwjiang2010
5d776b00e6
[tuner] fix result_grid (#23078) 2022-03-11 11:34:44 -08:00
xwjiang2010
f270d84094
[AIR] switch to a common RunConfig. (#23076) 2022-03-11 10:55:36 -08:00
SangBin Cho
2b38fe89e2
[Nightly tests] Migrate rest of core tests (#23085)
MIgrate the rest of core tests
2022-03-11 10:41:14 -08:00
Kai Fricke
04ea180dfb
[ci/release] Add "tiny" concurrency group, change limits (#23065)
E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups.
Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).
2022-03-11 10:19:38 -08:00
Stephanie Wang
28d597e009
Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)
This reverts commit 0a9f966e63.
2022-03-11 09:49:08 -08:00
shrekris-anyscale
665bdbff47
[serve] Exclude unset fields from Ray actor options (#23059)
The `schema_to_deployment()` function preserve unset fields with unexpected default argument types. This change excludes unset fields in that function and also changes the dictionaries' default values to empty dicts.
2022-03-11 10:45:21 -06:00
Kai Fricke
a8bed94ed6
[ci/release] Always use full cluster address (#23067)
Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09
2022-03-11 16:31:21 +00:00
Kenneth
07372927cc
Enable buffering and spilling to multiple remote storages (#22798)
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.

In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.

Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513

Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.

For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.

Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742


Together these changes enable faster large-scale object spilling.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-11 11:27:02 -05:00
Kai Fricke
61295f8b58
[ml/checkpoint] Fix checkpoint location on remote node (#23068)
Currently breaks tests where the checkpoint is stored on a remote node (e.g. via Ray client), e.g.: https://buildkite.com/ray-project/release-tests-branch/builds/132#6a4936a8-41dd-4fd2-9f02-976855cbd9b7
Instead, we can set the properties manually.
In the future, we need a story on how to refer to checkpoints kept on remote nodes.
2022-03-11 15:38:21 +00:00
Jialing He
0cbbb8c1d0
[runtime env][core] Use Proto message RuntimeEnvInfo between user code and core_worker (#22856) 2022-03-11 22:14:18 +08:00
SangBin Cho
965d609627
[Nightly test] Fix a minor syntax issue for core nightly tests (#23069)
Add frequency to smoke tests
Remove unnecessary alerts
2022-03-11 04:58:40 -08:00
Kai Fricke
5b2d58674b
[ci/release] Migrate horovod tests (#22951)
Migrating horovod tests to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/125
2022-03-11 09:53:29 +00:00
Kai Fricke
aed17dd346
Revert "Revert "[ml/tune] Expose new checkpoint interface to users (#22741)" (#23006)" (#23009)
This reverts commit 85598d9d10.

Test breakage was unrelated.
2022-03-11 09:51:41 +00:00
Tao Wang
10c03cb126
Migrating to flat hash map [GCS&util&common] (#22932)
Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules.
The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated.

The follow-up PRs would be migrating in core worker, placement group and others.
2022-03-11 18:35:06 +09:00
Yi Cheng
ec88eb7d1d
[4][resource reporting] Remove ray syncer from gcs_resource_manager (#22832)
This PR is part of resource reporting refactoring. In this PR ray syncer is moved from gcs_resource_manager to gcs_placement_group_scheduler. With this one, gcs_resource_manager is totally decoupled from resource broadcasting.
2022-03-11 01:15:25 -08:00
Jialing He
0c5440ee72
[runtime env] Deletes the proto cache on RuntimeEnv (#22944)
Mainly the following things:
- This PR deletes the proto cache on RuntimeEnv, ensuring that the user's modification of RuntimeEnv can take effect in the Proto message.
- validate whole runtime env when serialize runtime_env. 
- overload method `__setitem__` to parse and validate field when it has to modify.
2022-03-11 15:37:18 +08:00
matthewdeng
3a3a7b4be4
[test] add back deleted datasets train test file (#23051) 2022-03-10 21:46:07 -08:00
Amog Kamsetty
f80602b7d2
[Datasets] Separate pandas to torch conversion in to_torch (#22939)
Separate out the conversion of pandas dataframe to torch tensor in a utility function so that the same logic can be used in other places in Ray ML (for example during inference).
2022-03-10 20:40:01 -08:00
xwjiang2010
4b28bc3f09
[Tuner part1] Add Tuner interface. (#22975) 2022-03-10 19:55:59 -08:00
Siyuan (Ryans) Zhuang
0a9f966e63
[workflow] Convert DAG to workflow (#22925)
* convert DAG to a workflow

* deduplicate

* check duplication of steps

* add test for object refs
2022-03-10 19:40:14 -08:00
Eric Liang
148eaeac2e
[minor] Leave a big of wiggle room when calculating shared memory max (#23034) 2022-03-10 17:37:26 -08:00
Amog Kamsetty
9bd00f3e1a
[ml/train] Remove ConvertibleToTrainable and move Trainer to ray.ml.trainer (#23030)
As discussed,

- Removes ConvertibleToTrainable interface and makes as_trainable part of the Trainer interface
- Moves Trainer interface to ray.ml.trainer from ray.ml.train.trainer
2022-03-10 15:24:58 -08:00
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner (#22860)
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.

The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
Edward Oakes
5a18802ad7
[serve] Remove runtime-env arg from serve start (#23017) 2022-03-10 15:15:59 -06:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Amog Kamsetty
a5f41b2c9f
[ml/train] Training Interfaces [1/4]: Ray AIR Trainer interface (#22980) 2022-03-10 13:12:44 -08:00
Guyang Song
3d9f214833
[runtime env] Fix import in subprocess when using pip in runtime_env (#22983)
Fix the issue https://github.com/ray-project/ray/issues/22968
2022-03-10 15:11:41 -06:00
Max Pumperla
2b8faae40c
[docs] re/move old core examples (#22802) 2022-03-10 12:17:00 -08:00
xwjiang2010
b1496d235f
[tune] fix error handling for fail_fast case. (#22982) 2022-03-10 20:10:05 +00:00
Simon Mo
832354ce3f
[Serve] Compatibility bridge between model wrappers and pipeline (#22995) 2022-03-10 11:52:03 -08:00
Chen Shen
3ebc4ae289
fix comments and typo (#23008)
Fix comments and typos for scheduler code.
2022-03-10 11:40:31 -08:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
Yi Cheng
9f275c9bb8
[3][resource reporting] Use GCS to report the placement group creation information instead of reporting by raylet (#22597) 2022-03-10 11:08:21 -08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
Yi Cheng
bb5fa6b851
Remove redis in setup.py (#22979) 2022-03-10 11:05:03 -08:00
Archit Kulkarni
c78bd809ce
[job submission] Support local py_modules in jobs (#22843) 2022-03-10 11:42:25 -06:00
Stephanie Wang
85598d9d10
Revert "[ml/tune] Expose new checkpoint interface to users (#22741)" (#23006)
This reverts commit e9692a2a80.
2022-03-10 17:07:44 +00:00
SangBin Cho
92b50ff5da
Migrate multi nightly tests (#23005) 2022-03-11 01:32:10 +09:00
shrekris-anyscale
1100c98222
[serve] Implement Serve Application object (#22917)
The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.
2022-03-10 10:28:29 -06:00
Max Pumperla
d8e862eaba
[docs] templates and contribution guide (fixes #21753) (#23003)
Adding an explicit contributor guide and example templates for our users to help with docs.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-10 15:28:07 +00:00