Commit graph

1953 commits

Author SHA1 Message Date
Eric Liang
38925f60d2
Add a get_if_exists option for simpler creation of named actors (#23344)
Getting or creating a named actor is a common pattern, however it is somewhat esoteric in how to achieve this. Add a utility function and test that it doesn't cause any scary error messages.

Actor.options(name="my_singleton", get_if_exists=True).remote(args)
2022-03-23 22:02:58 -07:00
Jiajun Yao
ce93bfff7e
Fix broken doc link (#23440)
https://github.com/ray-project/ray/blob/master/benchmarks/README.md is moved to a new place.
2022-03-23 18:54:02 -07:00
Chen Shen
48d456d373
[RFC][Doc] add a page describe actor execution order. (#23406)
* add

* task-orders

* fix

* address comments

* add

* address comments
2022-03-23 11:07:18 -07:00
Kai Fricke
668eade515
[docs] Add oracle to linkcheck ignore list (#23422)
This link currently breaks the linter CI.
2022-03-23 17:14:52 +00:00
Max Pumperla
9b1a3f9f9a
[docs] fix nav (#23417)
Algolia search now does not overflow on mobile devices anymore, making the nav scrollable again.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-23 10:38:33 +00:00
mwtian
51feac9868
Clean up dev docs (#23407) 2022-03-22 23:22:56 -07:00
Richard Liaw
1fe110f8f4
[ml] Add a starter page for docstrings (#23312) 2022-03-21 17:20:45 -07:00
Kai Fricke
b64452bc63
[tune] Add multinode sync test (#23229)
This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.
2022-03-21 17:02:17 +00:00
Michael (Mike) Gelbart
99d60ef18c
[docs] Fix typos in ray docs contributing guide (#23360)
There are a couple typos in the [Ray contributing guide](https://docs.ray.io/en/master/ray-contribute/docs.html). I fixed the typos, added a relevant link, and reworded a sentence.
2022-03-21 10:01:41 -07:00
Jiajun Yao
d3159f201b
[Doc] Add scheduling doc (#23343) 2022-03-20 16:05:06 -07:00
Philipp Moritz
886cc4d674
Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
Junwen Yao
8fff665455
[Train] Add torch data prefetch benchmark example (#22974)
Add a benchmark example for the auto pipeline functionality for host to device data transfer.
2022-03-18 13:27:26 -07:00
Jian Xiao
0b1a2a44c0
[Dataset GA doc] Decompose the monolith of Getting Started page (and get them under User Guide) (#23311)
Improve the Dataset documentation for GA.
2022-03-18 11:25:43 -07:00
Jialing He
4a83bc3dc2
[runtime env] Support set timeout for runtime env setup (#23082)
Interface example:
```python
@ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10))
def f(): pass

@ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}})
def f(): pass
```

Support set timeout second for timeout of runtime environment creation.

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-03-18 12:52:59 -05:00
Archit Kulkarni
76bb5396c7
[Doc] [jobs] Add links to Job Submission and improve doc (#23209)
- Adds links to Job Submission from existing library tutorials where `ray submit` is used.  When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 12:52:13 -05:00
Archit Kulkarni
16fd099b8b
[runtime env] Change pip_check default from True to False (#23306)
@SongGuyang @Catch-Bull @edoakes  I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default.  My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise.  So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up.  Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR.

This also matches the default behavior of pip which allows this situation to occur in the first place:  `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")
2022-03-18 12:51:41 -05:00
shrekris-anyscale
86169d2452
[docs] Fix malformatted list in "Advanced Pattern: Fault Tolerance with Actor Checkpointing" (#23319) 2022-03-18 10:50:13 -07:00
Eric Liang
08dc31e747
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00
Guyang Song
1ad019aac3
[C++ API][Doc] Add doc and error log to notice C++ API is not supported on Windows (#23272)
We don't support Windows entirely now.

## Checks

- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2022-03-18 10:52:57 +08:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Archit Kulkarni
684a1821d3
[Doc] [runtime_env] Add limitation about single-file py_modules to doc (#23248)
Until #23151 is fixed, this PR adds it as a known limitation in the documentation.
2022-03-17 16:23:46 -05:00
Simon Mo
f400b4333a
[Serve] Remove legacy pipeline codebase (#23172) 2022-03-17 13:27:16 -07:00
Jian Xiao
8c9e3f6c2e
Move the third-party data integrations (non-Dataset stuff) out of the user guides which is for Dataset (#23162)
Improve documentation of Ray Dataset.
2022-03-17 11:27:40 -07:00
Eric Liang
c8f207f746
[docs] Core docs refactor (#23216)
This PR makes a number of major overhauls to the Ray core docs:

Add a key-concepts section for {Tasks, Actors, Objects, Placement Groups, Env Deps}.
Re-org the user guide to align with key concepts.
Rewrite the walkthrough to link to mini-walkthroughs in the key concept sections.
Minor tweaks and additional transition material.
2022-03-17 11:26:17 -07:00
Balaji Veeramani
83986a4d83
[Train] Add support for automatic mixed precision (#22227)
Closes #20643

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-19.us-west-2.compute.internal>
2022-03-16 20:53:02 -07:00
Archit Kulkarni
8707eb6288
[runtime env] Support .whl files in py_modules (#22368)
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster.  One gap in this is if the local Python module is in the form of a wheel (`.whl` file.)  This PR adds the missing support for uploading and installing the `.whl` file.
2022-03-16 16:37:10 -05:00
Max Pumperla
71c57c619b
[docs] RLlib broken links (fixes #23160) (#23226) 2022-03-16 12:38:18 +01:00
Kai Fricke
b80f79a072
[ci/multinode] Improve multi-node tests (#23196)
The current multi node tests use a hardcoded mapping for local development mounts. With this PR, a new environment variable is introduced to be able to control this dynamically. Additionally, some minor improvements to the test utilities and monitor are added.
2022-03-16 09:59:50 +00:00
Eric Liang
678d23fe42
Remove beta label from Datasets (#23220) 2022-03-15 23:05:59 -07:00
Jian Xiao
10435d2d8f
Update dask version for Ray 1.12.0 (#23197) 2022-03-15 19:22:19 -07:00
Jiaxin Shan
158ff3394f
[Job submission] Improve job submission docs (#23115)
I am following job submission docs here https://docs.ray.io/en/latest/cluster/job-submission.html  and run some examples.

I notice there're few minor issues.

1. some required libraries are not imported in any code snippets
2.  Get job api returns `{'status': 'SUCCEEDED'}` instead of `job_status` so code snippet here doesn't work https://docs.ray.io/en/latest/cluster/job-submission.html#rest-api
2022-03-15 21:20:33 -05:00
Antoni Baum
3625c4760f
[ML/Train] Add TensorflowTrainer interface (#23072)
Interface for TensorflowTrainer

Depends on #22988

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-15 14:02:17 -07:00
Eric Liang
ca1100397e
Update paper links to include exoshuffle and remove whitepaper (moved to docs) (#23099) 2022-03-15 13:12:01 -07:00
Balaji Veeramani
c694ed4594
[Train] Add enable_reproducibility (#22851)
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).

These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
2022-03-15 11:07:34 -07:00
Amog Kamsetty
e1f24a244b
[ml/train] Training Interfaces [3/4]: DataParallelTrainer interface (#22988)
Interface for DataParallelTrainer and updates to ScalingConfig definition.

Depends on #22986

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-03-15 08:11:05 -07:00
Max Pumperla
ad30123339
[docs] fix includes for md files (#23180)
the include of content for md files like our central getting started page didn't render. fixed here.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-15 11:09:18 +00:00
Pamphile Roy
81b17669a4
[core][docs] Document port/IP binding and slurm concerns (#22663)
Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later).

I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.
2022-03-15 01:43:46 -07:00
Jules S. Damji
0246f3532e
[DOC] Added a full example how to access deployments (#22401) 2022-03-14 21:15:52 -05:00
Jialing He
39a6c054d3
[runtime env][feature] introduce pip_check_enable and pip_version (#22826) 2022-03-14 23:41:19 +08:00
Jiaxin Shan
8823ca48b4
[Workflow] Improve workflow docs (#23114)
* [Workflow] Improve workflow docs

* Update doc/source/workflows/concepts.rst

Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
2022-03-13 18:55:45 -07:00
Scott Graham
f673acb0ad
Scgraham/azure docs (#22296)
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs

Co-authored-by: Scott Graham <scgraham@microsoft.com>
2022-03-13 18:08:08 -07:00
Kenneth
07372927cc
Enable buffering and spilling to multiple remote storages (#22798)
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.

In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.

Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513

Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.

For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.

Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742


Together these changes enable faster large-scale object spilling.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-11 11:27:02 -05:00
matthewdeng
3a3a7b4be4
[test] add back deleted datasets train test file (#23051) 2022-03-10 21:46:07 -08:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Max Pumperla
2b8faae40c
[docs] re/move old core examples (#22802) 2022-03-10 12:17:00 -08:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
Max Pumperla
d8e862eaba
[docs] templates and contribution guide (fixes #21753) (#23003)
Adding an explicit contributor guide and example templates for our users to help with docs.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-10 15:28:07 +00:00
Dmitri Gekhtman
413fe08f87
Move KubeRay autoscaler files into Ray autoscaler directory, add an entry-point. (#22847)
This PR consists of the following clean-up items for KubeRay autoscaler integration:

Remove the docker/kuberay directory

Move the Python files formerly in docker/kuberay to the autoscaler directory.

Use a rayproject/ray image for the autoscaler.

Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.

Slightly simplify the code that starts the autoscaler.

Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.

By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.

Add the autoscaler configuration test to the CI.

Update development documentation to reflect the changes in this PR.
2022-03-09 18:26:57 -08:00
Alex Wu
b84aaef38a
Promote python 3.9 support to stable (#22923)
Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-03-08 17:24:54 -08:00