Commit graph

11229 commits

Author SHA1 Message Date
Archit Kulkarni
01ee9adbe8
[Serve] [Doc] Improve model composition snippet (#21961) 2022-02-01 10:28:36 -08:00
Balaji Veeramani
7dcb0b6af6
[Train] Decorate get_device with PublicAPI (#22024)
* Decorate `get_device` with `PublicAPI`

* Add documentation

* Update api.rst
2022-02-01 08:18:47 -08:00
Kai Fricke
b51b5afaea
[ci/gpu] Move ML dependency install to Dockerfile (#21711)
Instead of installing dependencies in each Buildkite job, let's move this to the Dockerfile instead.
This will update GPU tests to always use Python 3.7.
2022-02-01 12:04:55 +00:00
Kai Fricke
e508e9f75a
[tune] Support functools.partial names and treat as function in registry (#21518)
Currently, tune trainables with functools.partial will raise the following warnings:

INFO registry.py:66 -- Detected unknown callable for trainable. Converting to class.
WARNING experiment.py:295 -- No name detected on trainable. Using DEFAULT.

This PR propagates function names for function wrapped with partial and treat them as regular functions when wrapping.
2022-02-01 12:04:24 +00:00
SangBin Cho
19672688b0
[Test] Change test_placement_group.py to large test (#21997)
We recently added tests to this file, and it seems to occasionally exceed 300 seconds timeout (before adding tests, it took about 260~270 seconds, so it is natural).

This promotes this test to be large so that we can avoid this issue. (Lmk if you think it is better sharding test even more.)
2022-01-31 22:37:35 -08:00
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
SangBin Cho
fd20cf3239
[Nightly Test] Add more metadata to test result (#21990)
Add a columns, error code, commit url, stable, session url, and runtime
2022-01-31 22:33:30 -08:00
Simon Mo
e3cf47d731
[Serve] Remove shard_key, http_method, and http_headers (#21590) 2022-01-31 22:27:12 -08:00
Chen Shen
4b528a7255
[resource-reporting 4/n] Separate cluster resource manager from cluster resource scheduler (#21992)
As discussed, we need to separate the cluster resource management logic from scheduling logic. In this PR, we create the cluster_resource_manager to handle the resource management; and the cluster resource scheduler is only responsible for scheduling.

* more clean up

* refactor

* address comments
2022-01-31 21:16:58 -08:00
Clark Zinzow
b3fd3c6828
[Datasets] Fix spread resource prefix tasks with no CPU requested. (#22017)
When applying the `_spread_resouce_prefix` hack, don't make the CPU resource a required resource when `num_cpus=0` is requested.
2022-01-31 18:30:47 -08:00
Clark Zinzow
00e1ac3a3c
[Datasets] Tie _DesignatedBlockOwner lifetime to context creator (#22007)
Instead of using a detached lifetime, tie the lifetime of `_DesignatedBlockOwner` to the lifetime of the context creator. Also, only create a `_DesignatedBlockOwner` if dynamic block splitting is enabled.
2022-01-31 17:06:01 -08:00
SangBin Cho
2db71f72cc
[Doc] Remove the legacy doc (#21996) 2022-01-31 15:26:19 -08:00
Clark Zinzow
03024b8951
[Datasets] Add .iter_batches() test for batch size larger than dataset. (#22000) 2022-01-31 14:09:48 -08:00
Yi Cheng
0659d4a472
[nightly] Limit many drivers iteration to 4000 iterations (#21958)
Due to faster running of many drivers, we limit the iteration to 4k for the test.
2022-01-31 13:26:02 -08:00
Kai Yang
2038cc96c6
Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" (#21894)
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.

Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
2022-01-31 12:09:51 -08:00
Eric Liang
45e03bd497
[data] Optimize dataset metadata read/write in Ray client (#21939) 2022-01-31 01:41:45 -08:00
Eric Liang
b73a007ccd
Flag off RAY_legacy_scheduler_warnings (#21965) 2022-01-30 17:12:45 -08:00
Eric Liang
fe167c94b1
Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] (#21970) 2022-01-30 17:11:54 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Eric Liang
95877be8ee
[data] Serialize parquet piece metadata in batches to reduce overheads 2022-01-29 14:30:50 -08:00
DK.Pino
91171a194f
[Core] Extract a common method to get predefined resource index #21895 2022-01-29 14:18:09 -08:00
Jiajun Yao
a3ea4343b3
Remove work pipelining (#21964) 2022-01-29 11:31:45 -08:00
Chen Shen
2939f153a1
address remaining comments (#21960) 2022-01-28 18:09:45 -08:00
Junwen Yao
eb8adc6105
[train] add a utility function to turn off TF autosharding (#21887)
This PR adds a utility function to turn off TF autosharding as a temporary solution.

Closes #19324.
2022-01-28 16:09:06 -08:00
Mehul Raheja
fe1bf0261a
[autoscaler] Support cache_stopped_nodes on Azure (#21747)
* basic reuse functionality without valid node filtering

* Filtering, logging, and formatting for cache_stopped_nodes on Azure

* Updated formatter version
2022-01-28 15:20:50 -08:00
Yi Cheng
570f67798a
[nightly] Move scheduling tests into one suite (#21959)
For future convenience, we are moving scheduling-related tests into one suite for easier monitoring and benchmarking.
2022-01-28 13:32:34 -08:00
Chen Shen
bfe3e5f4a8
add check on shape (#21947) 2022-01-28 12:27:43 -08:00
Archit Kulkarni
1f58ee3731
[1.10.0 Release] Add release logs for 1.10.0 (#21908)
* Copy logs from 1.9.0

* Replace 1.9.0 data with 1.10.0 data

* update with non-smoke-test results
2022-01-28 11:59:03 -08:00
Josh
4ab83345d0
[autoscaler] Ensure inital scaleup with high upscaling_speed isn't limited. (#21953)
We regularly run tasks where we know our expected resource requirements at launch, so call request_resources with the required number of cpus. The number of machines doesn't scale back down as our tasks are finishing, and just sit idle. This is costing more in aws hosting costs than necessary. Fix suggested is to not call request_resources and have a high upscaling_speed to instantly scale up to the required resources.
2022-01-28 11:34:11 -08:00
Jialing He
6cb2dffcc0
[Bug][UT] fix python case test_object_assign_owner never run (#21945) 2022-01-28 11:08:25 -08:00
Ian Rodney
75daf87aa0
[GCP] Add roles/iam.roleViewer (#21907)
Allows bootstrap_gcp to be called from the Head Node. This is the case with Tune's DockerSyncClient.
2022-01-28 10:20:51 -08:00
chenk008
51393abc16
[Core]delete shim pid flag (#21853)
Now we have `startup-token` to identify registering worker, so the shim pid flag is not needed any more.
2022-01-28 21:33:26 +08:00
Sven Mika
7fc1683bab
[RLlib] Some more bandit cleanup/tests. (#21932) 2022-01-28 12:03:26 +01:00
Chen Shen
0ff8bfacec
[resource-reporting 3/n] further clean up LocalResourceManager (#21927)
* clean up

* address comments
2022-01-28 01:50:54 -08:00
Gagandeep Singh
069c499def
Unskipped tests for Windows (#21890)
This is third unskipping PR.
2022-01-27 23:06:44 -08:00
Dmitri Gekhtman
1fee0159b4
[test][k8s] Minor adjustment to manual K8s tests (#21924)
This PR is a minor adjustment to the K8s release tests.

Replace tasks with actors in scale test for reduced flakiness
Use an up-to-date Ray client API.
2022-01-27 20:07:14 -08:00
Guyang Song
937bf6933c
[event] redefine "SetCustomFields" to "UpdateCustomFields" (#21930)
In some cases, we need to add custom fields in different code path. `SetCustomFields` will cover all the existing items, which leads to custom fields losing. This PR redefine `SetCustomFields` to `UpdateCustomFields `.  `UpdateCustomFields ` could keep existing items and merge new items. If the key already exists, replace the value.
2022-01-28 11:54:44 +08:00
Amog Kamsetty
bd726aab02
[Release] Disable caching for ray_lightning (#21886)
Passing tests: https://buildkite.com/ray-project/periodic-ci/builds/2560#_

Add an echo timestamp to the post build commands of the ray lightning release tests to trigger a cluster env rebuild and get the latest versions of ray lightning. Without this, the cluster env gets cached so an outdated version is installed on the cluster that is different than the one on the driver, resulting in the below failures.

Closes #21871
Closes #21863

Also reinstalls the dependencies in the post build commands so old versions are not cached in the Docker images
2022-01-27 17:56:32 -08:00
mwtian
97f7e3d0e6
[e2e] do not terminate in serve_failure smoke test (#21925)
When the script terminates, it will also terminate its cluster including dashboard, which will prevent subsequent job submissions. Other long running e2e tests do not terminate in smoke test mode, so make `serve_failure` behave the same.
2022-01-27 15:36:46 -08:00
Clark Zinzow
09fab70991
[Datasets] [Docs] Fix bug in Datasets locality-aware splitting example (#21937)
Fixes bug in Datasets locality-aware splitting example.
2022-01-27 14:46:04 -08:00
iasoon
b0700e676b
[serve] add root_path setting (#21090)
Support hosting a serve instance under a path prefix.

Some clean-up should still be done for the different overlapping HttpOptions that now exist (host, port, root_path, root_url).
2022-01-27 16:36:22 -06:00
mwtian
559eefd06f
[Doc] update dask version for Ray 1.11.0 (#21933)
This is needed for release 1.11.0.
2022-01-27 13:15:01 -08:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00
Sven Mika
ee41800c16
[RLlib] Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #02. (#21649) 2022-01-27 22:07:05 +01:00
Jun Gong
8ebc50f844
[RLlib] Issue 21334: Fix APPO when kl_loss is enabled. (#21855) 2022-01-27 20:08:58 +01:00
Sriram Sankar
b7391a1c39
[autoscaler] Optimize finding the node id (#21885)
This is a simple refactoring change and my first PR in ray-project. This change moves an if statement outside of a loop. This way the check is not repeated for each iteration.
2022-01-27 10:51:59 -08:00
Victor Yap
8be5f016af
Add NVIDIA_TESLA_A100 to accelerator types (#21558)
Adds Nvidia's A100 to the list of accelerator types. AWS offers this in the p4d.24xlarge instance type.
2022-01-27 10:47:09 -08:00
Jiajun Yao
cea80b1a5b
Don't advertise cpus on gpu nodes for pipelined ingestion tests (#21899)
* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests
2022-01-27 09:17:01 -08:00
Sven Mika
893536ebd9
[RLlib] Move bandits into main agents folder; Make RecSim adapter more accessible; (#21773) 2022-01-27 13:58:12 +01:00
Sven Mika
371fbb17e4
[RLlib] Make policies_to_train more flexible via callable option. (#20735) 2022-01-27 12:17:34 +01:00