Commit graph

13009 commits

Author SHA1 Message Date
mwtian
65d7a610ab
[Core] Push message to driver when a Raylet dies (#25516)
Currently when Raylets die, it is hard to figure out:

if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well.
reason of Raylet's death.
With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.
2022-06-09 05:54:34 -07:00
Jian Xiao
ce103b4ffa
Eagerly clears object memory before Python GC kicks in when consuming DatasetPipeline (#25461) 2022-06-09 00:37:56 -07:00
Amog Kamsetty
1316a2d05e
[AIR/Train] Move ray.air.train to ray.train (#25570) 2022-06-08 21:34:18 -07:00
Dmitri Gekhtman
836b08597f
[kuberay][autoscaler] Use new autoscaling fields from the KubeRay operator (#25386)
This PR incorporates recent autoscaler config changes from KubeRay.
2022-06-08 20:09:43 -07:00
matthewdeng
ba0a2a022a
[datasets] add Dataset.randomize_block_order (#25568)
This exposes a low-cost way to perform a pseudo global shuffle.

For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk.

This allows the workload to be spread across the nodes on which the dataset blocks are on.
2022-06-08 18:39:15 -07:00
M Waleed Kadous
9e2e84bc1c
[docs] Add an example for simple highly parallelizable tasks. (#24885)
It's important to show how Ray can be used for easily parallelizable independent tasks. I put this together to demonstrate how to di this.
2022-06-08 18:10:37 -07:00
Clark Zinzow
6987ab5966
[Datasets] [Hotfix] Fix stats construction for from_* APIs. (#25601)
Stats construction on the from_arrow and from_numpy (and from_pandas with Pandas block support disabled) is currently broken since we weren't resolving the block metadata before passing it to the stats, causing future ds.stats() calls to fail. This PR fixes this and adds some test coverage.

Drivebys:

- Adds stats for from_pandas() zero-copy path (metadata fetch only).
- Changes "from_numpy" stats stage name to "from_numpy_refs", to be consistent with stats for other from_*() APIs.
2022-06-08 18:04:40 -07:00
shrekris-anyscale
f3c2bd6718
[Serve] Make REST API deployments inherit top-level runtime_env (#25502) 2022-06-08 15:58:00 -07:00
Antoni Baum
7616435ed0
[Docs] Capitalize Ray AIR (#25597) 2022-06-08 14:37:53 -07:00
Kai Fricke
aa142eb377
[RLlib; CI] Add team:rllib tag for Bazel. (#25589)
Currently, team:ml spans all ML (Tune, Train, AIR) tests and rllib tests. rllib tests are much more flaky and it would be good to split them up in the flaky test tracker. This PR changes Rllib-tests from team:ml to team:rllib to enable this separation.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-08 22:25:59 +01:00
Archit Kulkarni
6d2806f951
[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562)
The current inheritance behavior for runtime_envs enables the following workflow for Jobs:  A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir.

There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior.

This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following:

- Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API
- Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test
2022-06-08 13:54:06 -07:00
Jian Xiao
50c854b1ad
Fix hyperlink in rst doc (#25427)
Hyperlink not working

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-06-08 13:46:23 -07:00
Antoni Baum
16733c2271
[AIR] Delayed type checking for Preprocessors (#25587)
Breaks the hard dependency on Preprocessor imports for type hints in AIR. Preparation for move of Preprocessors to `ray.data`.

Trainer still has a hard dependency due to an `isinstance` check.
2022-06-08 13:15:54 -07:00
Dmitri Gekhtman
5cc2e15a1f
[CI][minor] Disallow filters if command isn't specified (#25593)
Trivial "developer experience" tweak to the ci repro script:
disallow filtering commands if we're not running the commands.
2022-06-08 20:52:51 +01:00
Hanming Lu
d3e5bf97b5
more informative GCPNodeProvider create_node return (#25416)
More informative return value for GCPNodeProvider create_node
2022-06-08 12:34:09 -07:00
Amog Kamsetty
3a728c4e35
[Train] Mark Trainer interfaces as Deprecated (#25573)
Marks Trainer interfaces as Deprecated. This PR does not make any changes to the docs.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-08 12:30:32 -07:00
Stephanie Wang
6274bb354c
[tests] Deflake test_reconstruction.py::test_basic_reconstruction_actor_task[False] (#25456)
This test was flaky because actor tasks can fail if submitted when the actor process is failed or restarting. This PR changes the test to be more stressful so that the error is easier to reproduce and changes the max_retries parameter to -1 so that the actor task will succeed.

Related issue number

Closes #24942.
2022-06-08 11:21:57 -07:00
Artur Niederfahrenhorst
9226643433
[RLlib] Issue 4965: Fixes PyTorch grad clipping logic and adds grad clipping to QMIX. (#25584) 2022-06-08 19:40:57 +02:00
Sihan Wang
a9e7836e8c
[Serve] Skip flaky test_autoscaling_policy on windows (#25526) 2022-06-08 10:33:40 -07:00
Clark Zinzow
9dc0bb3d5e
[Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031)" (#25531)
Unreverts #24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them.

* Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031)" (#25057)"

This reverts commit fb2933a78f.

* Skip shuffle memory release test.
2022-06-08 10:33:25 -07:00
Amog Kamsetty
1be32e5977
[AIR] Add _predict_arrow interface for Predictor (#25579)
* add interface

* update docstring
2022-06-08 10:27:29 -07:00
Pamphile Roy
0bbc3379bd
Fix SciPy pinning (#25148)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2022-06-08 10:26:59 -07:00
Amog Kamsetty
80ae651f25
[Train] Clean up ray.train package (#25566) 2022-06-08 10:22:36 -07:00
Sven Mika
388fb98c79
[RLlib] CRR Tests fixes. (#25586) 2022-06-08 19:18:55 +02:00
Kai Fricke
c3b608f757
[tune] Fix cloud tests, mark as stable (#25583)
#25063 broke release tests, but they've been consistently stable before. This PR fixes the tests and marks tune cloud tests as stable.
2022-06-08 17:47:54 +01:00
Archit Kulkarni
3296345557
Add warning about entrpoint command in quotes (#25519) 2022-06-08 09:38:55 -07:00
ZhuSenlin
98685f415e
[Core] fix raylet stuck when forking worker (#25348)
The raylet may stuck when forking worker  because of the reading of fd, which is similar to this issue (https://github.com/gperftools/gperftools/issues/178).

At present, the decouple field is only set when running the CI test, so we can make the read function execute only when decouple is set, so it will not affect the production job. 

Child process(worker):
![image](https://user-images.githubusercontent.com/2016670/171430414-a0ebaed4-1aef-4faa-9e00-5e4d993a9db0.png)

Parent process(raylet):
![image](https://user-images.githubusercontent.com/2016670/171430507-2266b88d-2ad3-4493-b6a8-358c1613a561.png)

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-06-08 22:38:22 +08:00
Kai Fricke
8affbc7be6
[tune/train] Consolidate checkpoint manager 3: Ray Tune (#24430)
**Update**: This PR is now part 3 of a three PR group to consolidate the checkpoints.

1. Part 1 adds the common checkpoint management class #24771 
2. Part 2 adds the integration for Ray Train #24772
3. This PR builds on #24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class.

Old PR description:

This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core.

- This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core
- This PR prepares Tune to move to a CheckpointStrategy object
- In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-08 12:05:34 +01:00
kourosh hakhamaneshi
4cdd508f70
[RLlib] Added CRR implementation. (#25499) 2022-06-08 11:42:02 +02:00
Lixin Wei
00dbff507f
[Build] Upgrade the tool for generating compile_commands.json (#25430) 2022-06-08 15:00:49 +08:00
shrekris-anyscale
d75fd5d9f3
Make shrekris-anyscale a codeowner of Serve (docs) (#25576) 2022-06-07 18:25:46 -07:00
xwjiang2010
29a063afdf
[air] add feast example (#25417) 2022-06-07 14:55:42 -07:00
Amog Kamsetty
e0a63f770f
[Data/AIR] Move TensorExtension to ray.air for use in other packages (#25517)
Moves Tensor extensions to ray.air to facilitate their use in other Ray libraries (AIR, Serve).
2022-06-07 14:53:22 -07:00
xwjiang2010
76b34d4a03
[air] add to_air_checkpoint method for inference only workload. (#25444)
Follow up on our last discussion for supporting piecemeal fashion air users.
Only did for tensorflow for now, want to collect some feedback on API naming, package structure etc and I will add others.
2022-06-07 14:50:39 -07:00
Sebastián Ramírez
3257994e80
♻️ Refactor types to detect invalid extra arguments (#25541)
Currently, each function decorated with `@ray.remote` is marked with type annotations as a `RemoteFunction` class (only used for type annotations, autocompletion, inline errors, etc). The current class takes several *type parameters*. And then it uses those parameters in the extended `func.remote()` method.

But with the current type annotations, it marks any of the unused type parameters as `None`. This means that calling the `.remote()` method would check the first (actual) arguments and the rest are marked as `None`, but that means that for type annotations it considers "correct" to pass extra `None` arguments, while actually, that would not be valid. So, this doesn't show an error, but it should:

<img width="371" alt="Screenshot 2022-06-07 at 05 38 48" src="https://user-images.githubusercontent.com/1326112/172360355-9b344220-7824-4b5c-87da-038f5b53fe04.png">

...those 2 extra `None` values should be marked as invalid.

After this PR, those invalid extra arguments would be marked as invalid:

<img width="588" alt="Screenshot 2022-06-07 at 05 42 10" src="https://user-images.githubusercontent.com/1326112/172360956-424b40d4-8197-4663-8298-617a1df37658.png">

And:

<img width="687" alt="Screenshot 2022-06-07 at 05 42 50" src="https://user-images.githubusercontent.com/1326112/172361140-eb93c675-f5d6-4e0c-b9b2-83c4801bb450.png">

## More context

I also tried the new `TypeVarTuple`, it might simplify these type annotations in the future, but it's not currently supported by mypy yet, it's a very recent addition to the language (and `typing_extensions`) so it's probably too early to adopt it.
2022-06-07 14:34:34 -07:00
Antoni Baum
3876fcdbe8
[CI] Add bazel py_test checking for Serve (#25509) 2022-06-07 10:54:10 -07:00
Jun Gong
9b65d5535d
[RLlib] Introduce basic connectors library. (#25311) 2022-06-07 19:18:14 +02:00
Amog Kamsetty
4e887fe776
[Tune] Remove docstring for private _StatusReporter (#25520)
Remove outdated docstrings for _StatusReporter.

In response to https://discuss.ray.io/t/how-to-use-ray-tune-function-runner-statusreporter-with-tune-with-parameters/6400/2
2022-06-07 10:11:29 -07:00
Simon Mo
7471b1fa41
[Serve] [AIR] ModelWrapper improvements and docs (#25003)
* batching collation code and tests

* wip notebook for np and dataframe

* finish content

* reset ray-more-libs changes

* add comments

* run through

* Apply suggestions from code review

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>

* rename package

* lint

* richard's comment

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
2022-06-07 08:53:10 -07:00
John B Nelson
e913352bdc
[Doc] Remove trailing period symbol install instruction (#25543) 2022-06-07 08:08:04 -07:00
Kai Fricke
45bf925ef0
[train/serve] Fix torch tune serve test (#25547)
#24772 broke the smoke test as it was not run on CI - this PR hotfixes this
2022-06-07 15:54:37 +01:00
Kai Fricke
984b9a5e6c
[tune/train] Consolidate checkpoint manager 2: Ray Train (#24772)
This is a follow-up from #24771 which moves the Ray Train implementation to use the new common checkpoint manager class.
2022-06-07 13:51:42 +01:00
Rohan Potdar
a9d8da0100
[RLlib]: Doubly Robust Off-Policy Evaluation. (#25056) 2022-06-07 12:52:19 +02:00
Artur Niederfahrenhorst
429d0f0eee
[RLlib] Fix multi agent environment checks for observations that contain only some agents' obs each step. (#25506) 2022-06-07 10:33:35 +02:00
Artur Niederfahrenhorst
35bd397181
[RLlib] Better default values for training_intensity and target_network_update_freq for R2D2. (#25510) 2022-06-07 10:29:56 +02:00
Zhe Zhang
6793426a9d
[Docs; RLlib] Remove $ from rllib pip install instructions (#25358) 2022-06-07 08:57:17 +02:00
Philipp Moritz
ec02e78b01
[docs] Use better method to mock ObjectRef (#25535)
Actually fix #25498
2022-06-06 23:50:52 -07:00
Eric Liang
c1afbcb6f4
[air] Enforce API stability annotations for AIR module (#25485) 2022-06-06 22:52:21 -07:00
Eric Liang
78688a0903
Enable streaming ingest in AIR (#25428)
This adds the following options to DatasetConfig, which can be used to enable streaming ingest.

```
    # Whether the dataset should be streamed into memory using pipelined reads.
    # When enabled, get_dataset_shard() returns DatasetPipeline instead of Dataset.
    # The amount of memory to use is controlled by `stream_window_size`.
    # False by default for all datasets.
    use_stream_api: Optional[bool] = None

    # Configure the streaming window size in bytes. A typical value is something like
    # 20% of object store memory. If set to -1, then an infinite window size will be
    # used (similar to bulk ingest). This only has an effect if use_stream_api is set.
    # Set to 1.0 GiB by default.
    stream_window_size: Optional[float] = None

    # Whether to enable global shuffle (per pipeline window in streaming mode). Note
    # that this is an expensive all-to-all operation, and most likely you want to use
    # local shuffle instead.
    # False by default for all datasets.
    global_shuffle: Optional[bool] = None
```
2022-06-06 17:42:15 -07:00
Yi Cheng
aabe9e73ef
Revert "[Serve] Depend on uvicorn[standard] instead of uvicorn so that it pulls in uvloop (#25027)" (#25530)
This reverts commit 9a510f92cf.
2022-06-06 16:41:42 -07:00