Commit graph

6128 commits

Author SHA1 Message Date
Sven Mika
3f03ef8ba8
[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play. (#21356) 2022-02-03 09:32:09 +01:00
Max Pumperla
092598774a
[Docs] Executable notebook tutorial (#22030)
We're introducing the usage of [MyST Notebooks](https://myst-nb.readthedocs.io/en/latest/index.html) here and demonstrate how it works by rewriting (and extending) the RLLib Serve tutorial. Benefits:

- [x] Write notebooks in markdown. Can be converted into other formats e.g. with `jupytext`
- [x] Tutorials like this have a binderhub link added to the top nav (launch button).
- [x] Notebooks get executed when docs are built, so it's impossible to have stale docs.
- [x] But locally those builds are cached so that you don't have to wait too long.
- [x] The notebook cell outputs can be shown, hidden or removed.  In particular, we can now avoid adding expected code output as comments in our scripts (which might get outdated).

We're also clarifying  #22022. 

Old tutorial: [here](https://docs.ray.io/en/latest/serve/tutorials/rllib.html)
New tutorial (preview): [here](https://ray--22030.org.readthedocs.build/en/22030/serve/tutorials/rllib.html)

Co-authored-by: simon-mo <simon.mo@hey.com>
2022-02-03 08:13:04 +00:00
Eric Liang
c8f93dfdec
[data] Misc: add_column() and allow specifying decoding error handling in from_text() (#21967)
This adds some utility functions to make it easier to manipulate structured data in Datasets. While in principle you can already do this with map_batches, this makes it a little easier to test things out for development.
2022-02-02 20:47:17 -08:00
birgerbr
826a8bb06c
[core] Use FileLock on ports_by_node.json (#18909)
The new code uses a file-lock before reading and writing to `ports_by_node.json`.
Without it, multiple nodes may write to ports_by_node.json at the same time.
2022-02-02 14:57:19 -06:00
Archit Kulkarni
78f882dbbc
[runtime env] Local uri caching for working_dir, py_modules and conda (#20273)
Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them.  In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache.  When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs.

Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit

- Adds unit tests for caching and integration tests for working_dir caching
2022-02-02 14:53:03 -06:00
Chris K. W
c95abe75a9
[client] Consistent ray.init return value (#21355)
Proposal document: https://docs.google.com/document/d/1ln7_fUST18GOz4jJnI_zN00hfczXY48V5Ajy6fCmJCE/edit# 

This PR changes the return value of ray.init when not in client mode to be a RayContext, which acts as a context manager and the same public fields as ClientContext , as well a disconnect method (calls shutdown under the hood).

To prevent breaking scripts that rely on accessing through dict methods, RayContext also subclasses collections.abc.Mapping (can be treated as an immutable dict). This behavior will be removed in 2.0, so deprecation warnings are raised when __getitem__ is used. To make migration simple, an additional dict field address_info is added with the same values as the original return value.
2022-02-02 19:39:03 +02:00
SangBin Cho
9531887590
[Placement Group] Fix infeasible placement group not scheduled after node is added (#21993)
It looks like existing infeasible placement group in placement group manager didn't work properly. Idk how we added this feature when we cannot pass this simple test case.

But this is what has happend;

(1) PG is not scheduleable because it is infeasible
(2) New node is added
(3) After a new node is added, placement group manager tries rescheduling all infeasible pgs.
(4) Here, when we add a new node, we didn't report resources (this seems to be very weird. We are reporting resource using a separate RPC here). So when (3) happens, pg was still unschedulable. 

This PR fixes the issue by adding the resource information when the new node is added. 

Note that in the long term, we'd like to have a separate resource path from (4). This won't be addressed in this PR.
2022-02-02 06:44:42 -08:00
Yi Cheng
588d540b68
[client] Fix ray client object ref releasing in wrong context. (#22025) 2022-02-01 22:42:39 -08:00
Eric Liang
54fe2f80bb
[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" (#21566)
With the addition of https://github.com/ray-project/ray/pull/20988, the native format becomes ambiguous. This PR proposes to auto-promote arrow to pandas blocks when the user specifies "native" format, to avoid uncertainty.
2022-02-01 21:26:37 -08:00
Eric Liang
cc74037b2e
Report only memory usage of pinned object copies to improve scaledown (#22020)
Report only memory used by primary copies of objects, since secondary copies are not evicted even if not needed on a node. This prevents downscaling until all references to a shared object are removed.
Closes https://github.com/ray-project/ray/issues/21870
2022-02-01 21:14:28 -08:00
Zyiqin-Miranda
8237c6228f
[autoscaler] Add AWS Autoscaler CloudWatch Alarm support (#21523)
These changes add a set of improvements to enable automatic creation and update of CloudWatch alarms when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:

Setup alarms against Ray CloudWatch metrics to get notified about increased load, service outage.
Update their CloudWatch alarm JSON configuration files during Ray up execution time.
Notes:

This PR is a follow-up PR for #20266, which adds CloudWatch alarm support.
2022-02-01 18:09:53 -08:00
Will Frey
429b7b9512
[Serve] Update ray.serve.deployment overloaded signature (#21743) 2022-02-01 16:20:58 -08:00
shrekris-anyscale
8d43a6bac7
[Serve] [runtime env] Replace os.rename with shutil.move in remove_dir_from_filepaths() (#22018)
Currently, the `remove_dir_from_filepaths()` function uses `os.rename()` when shifting directories and files. This change replaces [`os.rename()`](https://docs.python.org/3/library/os.html#os.rename) with [`shutil.move()`](https://docs.python.org/3/library/shutil.html#shutil.move) to support these operations even when the directory's parent and the temporary directory are located on separate file systems.
2022-02-01 14:33:53 -06:00
SangBin Cho
0d179dabcd
[Test] Fix broken lint (#22026)
Fix the broken lint in the master. Details: https://buildkite.com/ray-project/ray-builders-branch/builds/5784#3c2cc53e-cf55-46f6-ab2d-d028d88d3d54
2022-02-01 11:03:32 -08:00
Archit Kulkarni
01ee9adbe8
[Serve] [Doc] Improve model composition snippet (#21961) 2022-02-01 10:28:36 -08:00
Balaji Veeramani
7dcb0b6af6
[Train] Decorate get_device with PublicAPI (#22024)
* Decorate `get_device` with `PublicAPI`

* Add documentation

* Update api.rst
2022-02-01 08:18:47 -08:00
Kai Fricke
e508e9f75a
[tune] Support functools.partial names and treat as function in registry (#21518)
Currently, tune trainables with functools.partial will raise the following warnings:

INFO registry.py:66 -- Detected unknown callable for trainable. Converting to class.
WARNING experiment.py:295 -- No name detected on trainable. Using DEFAULT.

This PR propagates function names for function wrapped with partial and treat them as regular functions when wrapping.
2022-02-01 12:04:24 +00:00
SangBin Cho
19672688b0
[Test] Change test_placement_group.py to large test (#21997)
We recently added tests to this file, and it seems to occasionally exceed 300 seconds timeout (before adding tests, it took about 260~270 seconds, so it is natural).

This promotes this test to be large so that we can avoid this issue. (Lmk if you think it is better sharding test even more.)
2022-01-31 22:37:35 -08:00
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
Simon Mo
e3cf47d731
[Serve] Remove shard_key, http_method, and http_headers (#21590) 2022-01-31 22:27:12 -08:00
Clark Zinzow
b3fd3c6828
[Datasets] Fix spread resource prefix tasks with no CPU requested. (#22017)
When applying the `_spread_resouce_prefix` hack, don't make the CPU resource a required resource when `num_cpus=0` is requested.
2022-01-31 18:30:47 -08:00
Clark Zinzow
00e1ac3a3c
[Datasets] Tie _DesignatedBlockOwner lifetime to context creator (#22007)
Instead of using a detached lifetime, tie the lifetime of `_DesignatedBlockOwner` to the lifetime of the context creator. Also, only create a `_DesignatedBlockOwner` if dynamic block splitting is enabled.
2022-01-31 17:06:01 -08:00
Clark Zinzow
03024b8951
[Datasets] Add .iter_batches() test for batch size larger than dataset. (#22000) 2022-01-31 14:09:48 -08:00
Kai Yang
2038cc96c6
Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" (#21894)
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.

Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
2022-01-31 12:09:51 -08:00
Eric Liang
45e03bd497
[data] Optimize dataset metadata read/write in Ray client (#21939) 2022-01-31 01:41:45 -08:00
Eric Liang
b73a007ccd
Flag off RAY_legacy_scheduler_warnings (#21965) 2022-01-30 17:12:45 -08:00
Eric Liang
fe167c94b1
Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] (#21970) 2022-01-30 17:11:54 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Eric Liang
95877be8ee
[data] Serialize parquet piece metadata in batches to reduce overheads 2022-01-29 14:30:50 -08:00
Jiajun Yao
a3ea4343b3
Remove work pipelining (#21964) 2022-01-29 11:31:45 -08:00
Junwen Yao
eb8adc6105
[train] add a utility function to turn off TF autosharding (#21887)
This PR adds a utility function to turn off TF autosharding as a temporary solution.

Closes #19324.
2022-01-28 16:09:06 -08:00
Mehul Raheja
fe1bf0261a
[autoscaler] Support cache_stopped_nodes on Azure (#21747)
* basic reuse functionality without valid node filtering

* Filtering, logging, and formatting for cache_stopped_nodes on Azure

* Updated formatter version
2022-01-28 15:20:50 -08:00
Josh
4ab83345d0
[autoscaler] Ensure inital scaleup with high upscaling_speed isn't limited. (#21953)
We regularly run tasks where we know our expected resource requirements at launch, so call request_resources with the required number of cpus. The number of machines doesn't scale back down as our tasks are finishing, and just sit idle. This is costing more in aws hosting costs than necessary. Fix suggested is to not call request_resources and have a high upscaling_speed to instantly scale up to the required resources.
2022-01-28 11:34:11 -08:00
Jialing He
6cb2dffcc0
[Bug][UT] fix python case test_object_assign_owner never run (#21945) 2022-01-28 11:08:25 -08:00
Ian Rodney
75daf87aa0
[GCP] Add roles/iam.roleViewer (#21907)
Allows bootstrap_gcp to be called from the Head Node. This is the case with Tune's DockerSyncClient.
2022-01-28 10:20:51 -08:00
chenk008
51393abc16
[Core]delete shim pid flag (#21853)
Now we have `startup-token` to identify registering worker, so the shim pid flag is not needed any more.
2022-01-28 21:33:26 +08:00
Gagandeep Singh
069c499def
Unskipped tests for Windows (#21890)
This is third unskipping PR.
2022-01-27 23:06:44 -08:00
Dmitri Gekhtman
1fee0159b4
[test][k8s] Minor adjustment to manual K8s tests (#21924)
This PR is a minor adjustment to the K8s release tests.

Replace tasks with actors in scale test for reduced flakiness
Use an up-to-date Ray client API.
2022-01-27 20:07:14 -08:00
iasoon
b0700e676b
[serve] add root_path setting (#21090)
Support hosting a serve instance under a path prefix.

Some clean-up should still be done for the different overlapping HttpOptions that now exist (host, port, root_path, root_url).
2022-01-27 16:36:22 -06:00
Sriram Sankar
b7391a1c39
[autoscaler] Optimize finding the node id (#21885)
This is a simple refactoring change and my first PR in ray-project. This change moves an if statement outside of a loop. This way the check is not repeated for each iteration.
2022-01-27 10:51:59 -08:00
Victor Yap
8be5f016af
Add NVIDIA_TESLA_A100 to accelerator types (#21558)
Adds Nvidia's A100 to the list of accelerator types. AWS offers this in the p4d.24xlarge instance type.
2022-01-27 10:47:09 -08:00
Kai Fricke
8dcd4a99ef
[tune/wandb] Use resume=False per default (#21892)
The WandbLoggingCallback is run on the driver side, with the experiment directory was the cwd. Using resume=True will pick up state from other trials (as the file name is global), and thus lead to warning messages. Thus, we should default to resume=False when using the callback.
This PR also incorporates changes from #20966.

Co-authored by: Queimo <queimo@gmx.net>
Co-authored by: Karim <karim.ben.hicham@rwth-aachen.de>
2022-01-27 07:58:01 +00:00
Yi Cheng
e6bbafc17a
[function table] Make sure FunctionsToRun are executed properly on all workers (#21867)
This PR fix the issue that sometimes FunctionsToRun is not executed. We isolated the Functions/Actors in function table, but not the RunctionsToRun. So when doing importing, sometimes, some functions will be missed.
This PR fixed this.
2022-01-26 21:58:43 -08:00
SangBin Cho
d363c37078
[Core] Stop Ray stop from killing redis that's not started by Ray (#21805)
Currently, `ray stop` logic is vulnerable, and it kills Redis server that's not started by Ray. This PR fixes the issue by better checking the executable name of redis-server (If it is redis-server created by Ray, it should contain Ray specific path copied while wheels are built).

I originally tried to obtain ppid and kill a redis-server only when it is created from the same parent, but it turns out all processes started by ray start has no ppid. 

While the best solution is to have some "process manager" that we can detect redis server started by us, I think there's no need to put lots of efforts here right now since Redis will be removed soon. We will eventually move to a better direction (process manager) to handle this sort of issues.
2022-01-26 18:12:38 -08:00
Dmitri Gekhtman
757b5a88ea
[autoscaler] Cap min and max workers for manually managed on-prem clusters. (#21710)
Closes https://github.com/ray-project/ray/issues/19636 by capping min and max workers for manually managed on-prem clusters to the number of user-specified worker ips.

See https://github.com/ray-project/ray/issues/19636#issuecomment-1016664169 for additional context.
2022-01-26 18:03:55 -08:00
Simon Mo
ac6709f0ba
[Serve] Fix uvicorn duplicate header issue (#21884) 2022-01-26 14:43:18 -08:00
xwjiang2010
80af046b54
[tune] deflake testBadParams5. (#21898)
The test is timing out during actor creation and ends up not testing the code which is only triggered after a training result is returned back to driver.
Change to use a simpler Trainable.
2022-01-26 19:38:15 +00:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
Wilson Wang
30a4761592
Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790)
This patch fixed two issues.

1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py.
2. it is possible that the signal handler can raise another exception during exception handling.
2022-01-25 14:07:26 -08:00