Commit graph

1938 commits

Author SHA1 Message Date
Archit Kulkarni
16fd099b8b
[runtime env] Change pip_check default from True to False (#23306)
@SongGuyang @Catch-Bull @edoakes  I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default.  My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise.  So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up.  Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR.

This also matches the default behavior of pip which allows this situation to occur in the first place:  `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")
2022-03-18 12:51:41 -05:00
shrekris-anyscale
86169d2452
[docs] Fix malformatted list in "Advanced Pattern: Fault Tolerance with Actor Checkpointing" (#23319) 2022-03-18 10:50:13 -07:00
Eric Liang
08dc31e747
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00
Guyang Song
1ad019aac3
[C++ API][Doc] Add doc and error log to notice C++ API is not supported on Windows (#23272)
We don't support Windows entirely now.

## Checks

- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2022-03-18 10:52:57 +08:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Archit Kulkarni
684a1821d3
[Doc] [runtime_env] Add limitation about single-file py_modules to doc (#23248)
Until #23151 is fixed, this PR adds it as a known limitation in the documentation.
2022-03-17 16:23:46 -05:00
Simon Mo
f400b4333a
[Serve] Remove legacy pipeline codebase (#23172) 2022-03-17 13:27:16 -07:00
Jian Xiao
8c9e3f6c2e
Move the third-party data integrations (non-Dataset stuff) out of the user guides which is for Dataset (#23162)
Improve documentation of Ray Dataset.
2022-03-17 11:27:40 -07:00
Eric Liang
c8f207f746
[docs] Core docs refactor (#23216)
This PR makes a number of major overhauls to the Ray core docs:

Add a key-concepts section for {Tasks, Actors, Objects, Placement Groups, Env Deps}.
Re-org the user guide to align with key concepts.
Rewrite the walkthrough to link to mini-walkthroughs in the key concept sections.
Minor tweaks and additional transition material.
2022-03-17 11:26:17 -07:00
Balaji Veeramani
83986a4d83
[Train] Add support for automatic mixed precision (#22227)
Closes #20643

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-19.us-west-2.compute.internal>
2022-03-16 20:53:02 -07:00
Archit Kulkarni
8707eb6288
[runtime env] Support .whl files in py_modules (#22368)
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster.  One gap in this is if the local Python module is in the form of a wheel (`.whl` file.)  This PR adds the missing support for uploading and installing the `.whl` file.
2022-03-16 16:37:10 -05:00
Max Pumperla
71c57c619b
[docs] RLlib broken links (fixes #23160) (#23226) 2022-03-16 12:38:18 +01:00
Kai Fricke
b80f79a072
[ci/multinode] Improve multi-node tests (#23196)
The current multi node tests use a hardcoded mapping for local development mounts. With this PR, a new environment variable is introduced to be able to control this dynamically. Additionally, some minor improvements to the test utilities and monitor are added.
2022-03-16 09:59:50 +00:00
Eric Liang
678d23fe42
Remove beta label from Datasets (#23220) 2022-03-15 23:05:59 -07:00
Jian Xiao
10435d2d8f
Update dask version for Ray 1.12.0 (#23197) 2022-03-15 19:22:19 -07:00
Jiaxin Shan
158ff3394f
[Job submission] Improve job submission docs (#23115)
I am following job submission docs here https://docs.ray.io/en/latest/cluster/job-submission.html  and run some examples.

I notice there're few minor issues.

1. some required libraries are not imported in any code snippets
2.  Get job api returns `{'status': 'SUCCEEDED'}` instead of `job_status` so code snippet here doesn't work https://docs.ray.io/en/latest/cluster/job-submission.html#rest-api
2022-03-15 21:20:33 -05:00
Antoni Baum
3625c4760f
[ML/Train] Add TensorflowTrainer interface (#23072)
Interface for TensorflowTrainer

Depends on #22988

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-15 14:02:17 -07:00
Eric Liang
ca1100397e
Update paper links to include exoshuffle and remove whitepaper (moved to docs) (#23099) 2022-03-15 13:12:01 -07:00
Balaji Veeramani
c694ed4594
[Train] Add enable_reproducibility (#22851)
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).

These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
2022-03-15 11:07:34 -07:00
Amog Kamsetty
e1f24a244b
[ml/train] Training Interfaces [3/4]: DataParallelTrainer interface (#22988)
Interface for DataParallelTrainer and updates to ScalingConfig definition.

Depends on #22986

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-03-15 08:11:05 -07:00
Max Pumperla
ad30123339
[docs] fix includes for md files (#23180)
the include of content for md files like our central getting started page didn't render. fixed here.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-15 11:09:18 +00:00
Pamphile Roy
81b17669a4
[core][docs] Document port/IP binding and slurm concerns (#22663)
Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later).

I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.
2022-03-15 01:43:46 -07:00
Jules S. Damji
0246f3532e
[DOC] Added a full example how to access deployments (#22401) 2022-03-14 21:15:52 -05:00
Jialing He
39a6c054d3
[runtime env][feature] introduce pip_check_enable and pip_version (#22826) 2022-03-14 23:41:19 +08:00
Jiaxin Shan
8823ca48b4
[Workflow] Improve workflow docs (#23114)
* [Workflow] Improve workflow docs

* Update doc/source/workflows/concepts.rst

Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
2022-03-13 18:55:45 -07:00
Scott Graham
f673acb0ad
Scgraham/azure docs (#22296)
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs

Co-authored-by: Scott Graham <scgraham@microsoft.com>
2022-03-13 18:08:08 -07:00
Kenneth
07372927cc
Enable buffering and spilling to multiple remote storages (#22798)
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.

In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.

Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513

Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.

For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.

Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742


Together these changes enable faster large-scale object spilling.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-11 11:27:02 -05:00
matthewdeng
3a3a7b4be4
[test] add back deleted datasets train test file (#23051) 2022-03-10 21:46:07 -08:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Max Pumperla
2b8faae40c
[docs] re/move old core examples (#22802) 2022-03-10 12:17:00 -08:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
Max Pumperla
d8e862eaba
[docs] templates and contribution guide (fixes #21753) (#23003)
Adding an explicit contributor guide and example templates for our users to help with docs.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-10 15:28:07 +00:00
Dmitri Gekhtman
413fe08f87
Move KubeRay autoscaler files into Ray autoscaler directory, add an entry-point. (#22847)
This PR consists of the following clean-up items for KubeRay autoscaler integration:

Remove the docker/kuberay directory

Move the Python files formerly in docker/kuberay to the autoscaler directory.

Use a rayproject/ray image for the autoscaler.

Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.

Slightly simplify the code that starts the autoscaler.

Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.

By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.

Add the autoscaler configuration test to the CI.

Update development documentation to reflect the changes in this PR.
2022-03-09 18:26:57 -08:00
Alex Wu
b84aaef38a
Promote python 3.9 support to stable (#22923)
Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-03-08 17:24:54 -08:00
Eric Liang
52491c87e2
Make a pass fixing Dataset API issues (#22886) 2022-03-08 13:07:55 -08:00
Max Pumperla
d6bff736f3
[docs] test ray.io snippets (#22822)
Tests all snippets we have on ray.io. There were some minor issues, which I'll fix upstream.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-08 15:50:57 +00:00
Stephanie Wang
cb218d03b9
[core] Enable lineage reconstruction by default (#22816)
Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.

Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
2022-03-07 17:40:30 -05:00
Max Pumperla
b609bdf898
[docs] Improve connection between library references and their APIs (#22800)
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-04 16:48:03 +01:00
Antoni Baum
283666fe02
[docs] Update XGBoost/LightGBM-Ray docs (#22783)
Brings the docs up to date with XGBoost/LightGBM-Ray readmes.
2022-03-03 18:02:43 +01:00
Archit Kulkarni
e937f1a3c4
[runtime env] [Doc] add more details about runtime env logs (#22480)
Clarifies the logging behavior for runtime envs, and adds the runtime env logs fileto the list of log files in the main logging page.
2022-03-02 14:27:28 -08:00
Max Pumperla
d53d0e0f50
[docs] Typo - fixes #22761 (#22763)
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:34:46 +01:00
Max Pumperla
7d4296c72f
run code in browser (#22727)
Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand.
If this seems useful we can extend this to other examples gradually.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:27:00 +01:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Eric Liang
5a0b7a7ee0
Document Dataset pipeline stage fusion (#22737) 2022-03-01 14:38:09 -08:00
Eric Liang
e228544d39
Undo revert of windowing dataset by bytes (#22735) 2022-03-01 12:24:04 -08:00
Kenneth
9b67cb5a6f
Add buffering to object spilling (#22618)
This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes.

### Performance Changes
A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling.

| Run | Produce (s) | Consume (s) | Total (s) |
| -- | -- | -- | -- |
| Baseline (original) | 347.332281 | 355.611272 | 705.560750 |
| Baseline (w/ fix) | 181.815852 | 347.692850 | 532.847759 |
| No fusing (original) | 453.574554 | 525.047998 | 981.620108 |
| No fusing (w/ fix) | 452.614848| 519.787698 | 975.412639 |

The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled.

See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-01 10:13:10 -08:00
Stephanie Wang
73f078236f
[doc] Update docs about actor garbage collection (#20763)
Update outdated actor docs about when actors are GCed.
2022-02-28 18:45:29 -08:00
Jiaxin Shan
32829ff9ad
[KubeRay] Provide a new Dockerfile for fast build (#22689)
Adds a new Dockerfile for fast build and development of KubeRay.
2022-02-28 17:09:16 -08:00
Archit Kulkarni
85657b1377
[Doc] [Jobs] add CLI and SDK reference to docs (#22680) 2022-02-28 17:57:46 -06:00