Commit graph

1912 commits

Author SHA1 Message Date
Kenneth
07372927cc
Enable buffering and spilling to multiple remote storages (#22798)
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.

In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.

Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513

Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.

For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.

Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742


Together these changes enable faster large-scale object spilling.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-11 11:27:02 -05:00
matthewdeng
3a3a7b4be4
[test] add back deleted datasets train test file (#23051) 2022-03-10 21:46:07 -08:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Max Pumperla
2b8faae40c
[docs] re/move old core examples (#22802) 2022-03-10 12:17:00 -08:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
Max Pumperla
d8e862eaba
[docs] templates and contribution guide (fixes #21753) (#23003)
Adding an explicit contributor guide and example templates for our users to help with docs.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-10 15:28:07 +00:00
Dmitri Gekhtman
413fe08f87
Move KubeRay autoscaler files into Ray autoscaler directory, add an entry-point. (#22847)
This PR consists of the following clean-up items for KubeRay autoscaler integration:

Remove the docker/kuberay directory

Move the Python files formerly in docker/kuberay to the autoscaler directory.

Use a rayproject/ray image for the autoscaler.

Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.

Slightly simplify the code that starts the autoscaler.

Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.

By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.

Add the autoscaler configuration test to the CI.

Update development documentation to reflect the changes in this PR.
2022-03-09 18:26:57 -08:00
Alex Wu
b84aaef38a
Promote python 3.9 support to stable (#22923)
Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-03-08 17:24:54 -08:00
Eric Liang
52491c87e2
Make a pass fixing Dataset API issues (#22886) 2022-03-08 13:07:55 -08:00
Max Pumperla
d6bff736f3
[docs] test ray.io snippets (#22822)
Tests all snippets we have on ray.io. There were some minor issues, which I'll fix upstream.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-08 15:50:57 +00:00
Stephanie Wang
cb218d03b9
[core] Enable lineage reconstruction by default (#22816)
Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.

Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
2022-03-07 17:40:30 -05:00
Max Pumperla
b609bdf898
[docs] Improve connection between library references and their APIs (#22800)
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-04 16:48:03 +01:00
Antoni Baum
283666fe02
[docs] Update XGBoost/LightGBM-Ray docs (#22783)
Brings the docs up to date with XGBoost/LightGBM-Ray readmes.
2022-03-03 18:02:43 +01:00
Archit Kulkarni
e937f1a3c4
[runtime env] [Doc] add more details about runtime env logs (#22480)
Clarifies the logging behavior for runtime envs, and adds the runtime env logs fileto the list of log files in the main logging page.
2022-03-02 14:27:28 -08:00
Max Pumperla
d53d0e0f50
[docs] Typo - fixes #22761 (#22763)
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:34:46 +01:00
Max Pumperla
7d4296c72f
run code in browser (#22727)
Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand.
If this seems useful we can extend this to other examples gradually.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:27:00 +01:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Eric Liang
5a0b7a7ee0
Document Dataset pipeline stage fusion (#22737) 2022-03-01 14:38:09 -08:00
Eric Liang
e228544d39
Undo revert of windowing dataset by bytes (#22735) 2022-03-01 12:24:04 -08:00
Kenneth
9b67cb5a6f
Add buffering to object spilling (#22618)
This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes.

### Performance Changes
A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling.

| Run | Produce (s) | Consume (s) | Total (s) |
| -- | -- | -- | -- |
| Baseline (original) | 347.332281 | 355.611272 | 705.560750 |
| Baseline (w/ fix) | 181.815852 | 347.692850 | 532.847759 |
| No fusing (original) | 453.574554 | 525.047998 | 981.620108 |
| No fusing (w/ fix) | 452.614848| 519.787698 | 975.412639 |

The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled.

See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-01 10:13:10 -08:00
Stephanie Wang
73f078236f
[doc] Update docs about actor garbage collection (#20763)
Update outdated actor docs about when actors are GCed.
2022-02-28 18:45:29 -08:00
Jiaxin Shan
32829ff9ad
[KubeRay] Provide a new Dockerfile for fast build (#22689)
Adds a new Dockerfile for fast build and development of KubeRay.
2022-02-28 17:09:16 -08:00
Archit Kulkarni
85657b1377
[Doc] [Jobs] add CLI and SDK reference to docs (#22680) 2022-02-28 17:57:46 -06:00
SangBin Cho
ba4f1423c7
Revert "Support creating a DatasetPipeline windowed by bytes (#22577)" (#22695)
This reverts commit b5b4460932.
2022-02-28 11:56:12 -08:00
Jialing He
98a69cbd90
[runtime env][strong-typed API] Combine ParsedRuntimeEnv and RuntimeEnv into ray.runtime.RuntimeEnv (#22522)
Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495

- The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`.
- The `new RuntimeEnv` will be exposed directly to the user.
- example:
```python
runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", 
        pip=["requests"],
        java_jars=["s3://jar1.zip"],
        java_jvm_options=["-Dxxx=xxx"])
```
2022-02-28 16:18:10 +08:00
Max Pumperla
372c620f58
[docs] Tune overhaul part II (#22656)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-02-26 23:07:34 -08:00
Eric Liang
b5b4460932
Support creating a DatasetPipeline windowed by bytes (#22577) 2022-02-25 23:31:10 -08:00
Antoni Baum
d5284a740c
[tune] Remove Trainable.update_resources (#22471) 2022-02-25 08:38:34 -08:00
xwjiang2010
d4a1bc7bc7
Revert "[runtime env] runtime env inheritance refactor (#22244)" (#22626)
Breaks train_torch_linear_test.py.
2022-02-25 08:42:30 -06:00
Eric Liang
533a0440a6
Improve actor pool support in Datasets (#22574) 2022-02-24 12:01:36 -08:00
jon-chuang
11500dc12c
[docs] include ray status and ray monitor into ray command line api docs (#22614)
Fixes: https://github.com/ray-project/ray/issues/18527
2022-02-23 20:09:45 -08:00
Amog Kamsetty
80e0d9cea4
[Train] Update docs for ray.train.torch import (#22555)
Update more examples to include the ray.train.torch import line. Follow up to #21969
2022-02-23 19:22:27 -08:00
Edward Oakes
5a21289a34
[runtime_env] Remove get_current_runtime_env from docs (#22594)
We should just encourage people to use the existing `get_runtime_context` API instead of introducing a new one here. Just removing the docs for now while we discuss this.
2022-02-23 16:53:52 -06:00
Archit Kulkarni
87f7bfe4cd
[doc] [job submission] Add k8s instructions and a comment about ports (#22598) 2022-02-23 16:32:37 -06:00
Sven Mika
8e00537b65
[RLlib] SlateQ: framework=tf fixes and SlateQ documentation update (#22543) 2022-02-23 13:03:45 +01:00
mwtian
9a157dfe82
[GCS-Ray] update doc and error message for GCS-Ray (#22528)
Update documentation to reflect that Ray no longer starts Redis by default.
2022-02-22 17:56:30 -08:00
Dmitri Gekhtman
a402e956a4
[KubeRay] Format autoscaling config based on RayCluster CR (#22348)
Closes #21655. At the start of each autoscaler iteration, we read the Ray Cluster CR from K8s and use it to extract the autoscaling config.
2022-02-22 11:06:37 -08:00
Antoni Baum
4a15c6f8f3
[tune] Preparation for deadline schedulers (#22006) 2022-02-22 11:05:28 -08:00
Guyang Song
5783cdb254
[runtime env] runtime env inheritance refactor (#22244)
Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc):
- The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment.
- Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent.

We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run.

Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior:
- **If there is no runtime env option when we create actor, inherit the parent runtime env.**
- **Otherwise, use the optional runtime env directly and don't do the merging.**

Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like:
```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))```
This new API also can be used in ray client.
2022-02-21 18:13:22 +08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Archit Kulkarni
8c12e30f11
[Doc] Add actor max restarts default value to fault tolerance doc (#22481) 2022-02-18 17:48:22 -06:00
Max Pumperla
9482f03134
[docs] RLlib concepts consolidation, user guide, RL conf prep (#22496) 2022-02-18 09:35:20 -08:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Ian Rodney
c9a4b17f99
[YAMLs] Fix comments about autoscaler round-robining (#22002) 2022-02-17 13:59:05 -08:00
Sven Mika
e03606f0b3
[RLlib] Bandit documentation enhancements. (#22427) 2022-02-17 13:25:50 +01:00
Qing Wang
7c45d1a366
[doc][Java] Add doc page for java concurrency group. (#21600)
Add document page for Java concurrency group.

Co-authored-by: Kai Yang <kfstorm@outlook.com>
2022-02-16 17:57:03 +08:00
Simon Mo
495221e7d2
[Doc] Update Serve logo for tune user guide (#22369)
We have deprecated the old logo.
2022-02-15 12:10:08 -06:00
Hao Chen
78597d3089
[train] Minor fixes on Ray Train user guide doc (#22379)
Fixes some typos and format issues.
2022-02-15 10:09:27 -08:00
Jun Gong
b729a9390f
[RLlib] Add example commands for using setup-dev.py with RLlib for improved dev setup stability and developer experience. (#22380) 2022-02-15 12:00:36 +01:00