Commit graph

12942 commits

Author SHA1 Message Date
Eric Liang
b52cd964cb
[docs] Move the workflows (alpha) library to the more libraries section for now (#25704) 2022-06-11 19:47:45 -07:00
Philipp Moritz
d8ec5929b6
Exclude Bazel build files from Ray wheels (#25679)
Including the Bazel build files in the wheel leads to problems if the Ray wheels are brought in as a dependency from another bazel workspace, since that workspace will not recurse into the directories of the wheel that contain BUILD files -- this can lead to dropped files.

This only happens for macOS wheels, on linux wheels the BUILD files were already excluded.
2022-06-11 16:05:59 -07:00
Kai Fricke
736c7b13c4
[CI] Fix team to rllib (from ml) for some replay buffer API tests. (#25702) 2022-06-11 18:05:16 +02:00
Sven Mika
130b7eeaba
[RLlib] Trainer to Algorithm renaming. (#25539) 2022-06-11 15:10:39 +02:00
Yi Cheng
0c527b4502
[1/2][serve] Use GcsClient to replace the kv client to use timeout. (#25633)
Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly.

This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.
2022-06-10 23:41:49 -07:00
Eric Liang
d36fd77548
[air] Allow fusing task and actor stages if they have compatible resource types (#25683) 2022-06-10 19:04:27 -07:00
Clark Zinzow
4fb92dd2f1
[Datasets] Fix __array__ protocol on TensorArrayElement and TensorArray. (#25647)
This PR fixes two issues with the __array__ protocol on the tensor extension:

1. The __array__ protocol on TensorArrayElement was missing the dtype parameter, causing np.asarray(tae, dtype=some_dtype) calls to fail. This PR adds support for the dtype argument.
2. TensorArray and TensorArrayElement didn't support NumPy's scalar casting semantics for single-element tensors. This PR adds support for these scalar casting semantics.
2022-06-10 16:42:16 -07:00
Richard Liaw
1dd714e0fa
[rfc][doc] Add clarity to stability guidelines (#25611) 2022-06-10 15:19:21 -07:00
Avnish Narayan
d0f975e00f
[RLlib] Fix broken link replay buffer docs. (#25666) 2022-06-10 21:18:59 +02:00
mwtian
dcfed617e5
[Core] fix gRPC handlers' unlimited active calls configuration (#25626)
Ray's gRPC server wrapper configures a max active call setting for each handler. When the max active call is -1, the handler is supposed to allow handling unlimited number of requests concurrently. However in practice it is often observed that handlers configured with unlimited active calls are still handling at most 100 requests concurrently.

This is a result of the existing logic:

At a high level, each gRPC method is associated with a number of ServerCall objects (acting as "tags") in the gRPC completion queue. When there is no tag for a method, gRPC server thread will not be able to poll requests from the method call from the completion queue. After a request is polled from the completion queue, it is processed by the polling gRPC server thread, then queued to an eventloop.
When a handler is in the "unlimited" mode, it creates when a new ServerCall object (tag) before actual processing. The problem is that new ServerCalls are created on the eventloop instead of the gRPC server thread. When the event loop runs a callback from the gRPC server, the callback creates a new ServerCall object, and can run the gRPC handler to completion if the handler does not have any async step. So overall, the event loop will not run more callbacks than the initial number of ServerCalls, which is 100 in the "unlimited" mode.
The solution is to create a new ServerCall in the gRPC server thread, before sending the ServerCall to the eventloop.

Running some night tests to verify the fix does not introduce instabilities: https://buildkite.com/ray-project/release-tests-branch/builds/652

Also, looking into adding gRPC server / client stress tests with large number of concurrent requests.
2022-06-10 11:28:41 -07:00
Jiao
6b9b1f135b
[Deployment Graph] Move files out of pipeline folder (#25630) 2022-06-10 10:39:03 -07:00
Sihan Wang
2546fbf99d
[Serve] Autoscaling for deployment graph (#25424) 2022-06-10 10:21:49 -07:00
mwtian
1483c4553c
use smaller instance for scheduling tests (#25635)
m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.
2022-06-10 17:09:35 +00:00
Simon Mo
271c7d73ac
[AIR][Serve] Add support for multi-modal array input (#25609) 2022-06-10 09:19:42 -07:00
Sven Mika
7c39aa5fac
[RLlib] Trainer.training_iteration -> Trainer.training_step; Iterations vs reportings: Clarification of terms. (#25076) 2022-06-10 17:09:18 +02:00
Artur Niederfahrenhorst
94d6c212df
[RLlib] Replay Buffer API documentation. (#24683) 2022-06-10 16:47:51 +02:00
Artur Niederfahrenhorst
c3645928ca
[RLlib] Fix no gradient clipping happening in QMix. (#25656) 2022-06-10 13:51:26 +02:00
Avnish Narayan
730df43656
[RLlib] Issue 25503: Replace torch.range with torch.arange. (#25640) 2022-06-10 13:21:54 +02:00
kourosh hakhamaneshi
b3a351925d
[RLlib] Added meaningful error for multi-agent failure of SampleCollector in case no agent steps in episode. (#25596) 2022-06-10 12:30:43 +02:00
Artur Niederfahrenhorst
8af9ef8fee
[RLlib] Discussion 6432: Automatic train_batch_size calculation fix. (#25621) 2022-06-10 12:15:57 +02:00
Jian Xiao
67b2eca6a2
Fix a few type annotations that may confuse people (#25645) 2022-06-09 23:15:21 -07:00
shrekris-anyscale
5586b89b1c
[Serve] Improve logs for new Serve REST API (#25610) 2022-06-09 17:04:09 -07:00
Antoni Baum
445400d727
[CI] Print a summary of broken links in LinkCheck (#25634) 2022-06-09 17:03:53 -07:00
Amog Kamsetty
2614c24e47
[AIR] Add predict_pandas implementation (#25534)
Implements conversion utilities and a default predict implementation for Predictor.

Depends on #25517
2022-06-09 16:55:58 -07:00
matthewdeng
88524d8b57
[air] add CustomStatefulPreprocessor (#25497) 2022-06-09 16:54:46 -07:00
Archit Kulkarni
6f3de2af86
[Serve] Fix outdated Serve warning message for sync handle (#25453) 2022-06-09 14:50:48 -07:00
Simon Mo
ef1b565699
[CI] Pin starlette and fastapi version (#25604) 2022-06-09 13:55:18 -07:00
Jimmy Yao
2511e66d7e
[Datasets] [AIR] Fixes label tensor squeezing in to_tf() (#25553) 2022-06-09 12:32:13 -07:00
Eric Liang
a058a98c5d
[docs] Try to clarify some advantages of bulk ingest in the AIR ingest docs (#25616) 2022-06-09 11:47:22 -07:00
Kai Fricke
f17ced04dd
[air/tune] Exclude in remote storage upload (#25544)
This adds an exclude option to upload_to_uri() which will be needed for refactoring the Tune syncing/sync client structure.
2022-06-09 20:12:53 +02:00
Robert
a92a06860f
[Datasets] Allow for len(Dataset) (#25152)
Small QOL change that allows for len(Dataset) to be used rather than calling Dataset.count()
2022-06-09 10:36:41 -07:00
matthewdeng
eff72f9a72
[train] fix transformers example for multi-gpu (#24832)
Accelerate depends on this environment variable to set for proper GPU device placement.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-09 09:17:35 -07:00
Artur Niederfahrenhorst
7495e9c89c
[RLlib] Dreamer Policy sub-classing schema. (#25585) 2022-06-09 17:14:15 +02:00
mwtian
65d7a610ab
[Core] Push message to driver when a Raylet dies (#25516)
Currently when Raylets die, it is hard to figure out:

if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well.
reason of Raylet's death.
With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.
2022-06-09 05:54:34 -07:00
Jian Xiao
ce103b4ffa
Eagerly clears object memory before Python GC kicks in when consuming DatasetPipeline (#25461) 2022-06-09 00:37:56 -07:00
Amog Kamsetty
1316a2d05e
[AIR/Train] Move ray.air.train to ray.train (#25570) 2022-06-08 21:34:18 -07:00
Dmitri Gekhtman
836b08597f
[kuberay][autoscaler] Use new autoscaling fields from the KubeRay operator (#25386)
This PR incorporates recent autoscaler config changes from KubeRay.
2022-06-08 20:09:43 -07:00
matthewdeng
ba0a2a022a
[datasets] add Dataset.randomize_block_order (#25568)
This exposes a low-cost way to perform a pseudo global shuffle.

For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk.

This allows the workload to be spread across the nodes on which the dataset blocks are on.
2022-06-08 18:39:15 -07:00
M Waleed Kadous
9e2e84bc1c
[docs] Add an example for simple highly parallelizable tasks. (#24885)
It's important to show how Ray can be used for easily parallelizable independent tasks. I put this together to demonstrate how to di this.
2022-06-08 18:10:37 -07:00
Clark Zinzow
6987ab5966
[Datasets] [Hotfix] Fix stats construction for from_* APIs. (#25601)
Stats construction on the from_arrow and from_numpy (and from_pandas with Pandas block support disabled) is currently broken since we weren't resolving the block metadata before passing it to the stats, causing future ds.stats() calls to fail. This PR fixes this and adds some test coverage.

Drivebys:

- Adds stats for from_pandas() zero-copy path (metadata fetch only).
- Changes "from_numpy" stats stage name to "from_numpy_refs", to be consistent with stats for other from_*() APIs.
2022-06-08 18:04:40 -07:00
shrekris-anyscale
f3c2bd6718
[Serve] Make REST API deployments inherit top-level runtime_env (#25502) 2022-06-08 15:58:00 -07:00
Antoni Baum
7616435ed0
[Docs] Capitalize Ray AIR (#25597) 2022-06-08 14:37:53 -07:00
Kai Fricke
aa142eb377
[RLlib; CI] Add team:rllib tag for Bazel. (#25589)
Currently, team:ml spans all ML (Tune, Train, AIR) tests and rllib tests. rllib tests are much more flaky and it would be good to split them up in the flaky test tracker. This PR changes Rllib-tests from team:ml to team:rllib to enable this separation.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-08 22:25:59 +01:00
Archit Kulkarni
6d2806f951
[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562)
The current inheritance behavior for runtime_envs enables the following workflow for Jobs:  A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir.

There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior.

This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following:

- Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API
- Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test
2022-06-08 13:54:06 -07:00
Jian Xiao
50c854b1ad
Fix hyperlink in rst doc (#25427)
Hyperlink not working

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-06-08 13:46:23 -07:00
Antoni Baum
16733c2271
[AIR] Delayed type checking for Preprocessors (#25587)
Breaks the hard dependency on Preprocessor imports for type hints in AIR. Preparation for move of Preprocessors to `ray.data`.

Trainer still has a hard dependency due to an `isinstance` check.
2022-06-08 13:15:54 -07:00
Dmitri Gekhtman
5cc2e15a1f
[CI][minor] Disallow filters if command isn't specified (#25593)
Trivial "developer experience" tweak to the ci repro script:
disallow filtering commands if we're not running the commands.
2022-06-08 20:52:51 +01:00
Hanming Lu
d3e5bf97b5
more informative GCPNodeProvider create_node return (#25416)
More informative return value for GCPNodeProvider create_node
2022-06-08 12:34:09 -07:00
Amog Kamsetty
3a728c4e35
[Train] Mark Trainer interfaces as Deprecated (#25573)
Marks Trainer interfaces as Deprecated. This PR does not make any changes to the docs.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-08 12:30:32 -07:00
Stephanie Wang
6274bb354c
[tests] Deflake test_reconstruction.py::test_basic_reconstruction_actor_task[False] (#25456)
This test was flaky because actor tasks can fail if submitted when the actor process is failed or restarting. This PR changes the test to be more stressful so that the error is easier to reproduce and changes the max_retries parameter to -1 so that the actor task will succeed.

Related issue number

Closes #24942.
2022-06-08 11:21:57 -07:00