Commit graph

11103 commits

Author SHA1 Message Date
Chen Shen
bfe3e5f4a8
add check on shape (#21947) 2022-01-28 12:27:43 -08:00
Archit Kulkarni
1f58ee3731
[1.10.0 Release] Add release logs for 1.10.0 (#21908)
* Copy logs from 1.9.0

* Replace 1.9.0 data with 1.10.0 data

* update with non-smoke-test results
2022-01-28 11:59:03 -08:00
Josh
4ab83345d0
[autoscaler] Ensure inital scaleup with high upscaling_speed isn't limited. (#21953)
We regularly run tasks where we know our expected resource requirements at launch, so call request_resources with the required number of cpus. The number of machines doesn't scale back down as our tasks are finishing, and just sit idle. This is costing more in aws hosting costs than necessary. Fix suggested is to not call request_resources and have a high upscaling_speed to instantly scale up to the required resources.
2022-01-28 11:34:11 -08:00
Jialing He
6cb2dffcc0
[Bug][UT] fix python case test_object_assign_owner never run (#21945) 2022-01-28 11:08:25 -08:00
Ian Rodney
75daf87aa0
[GCP] Add roles/iam.roleViewer (#21907)
Allows bootstrap_gcp to be called from the Head Node. This is the case with Tune's DockerSyncClient.
2022-01-28 10:20:51 -08:00
chenk008
51393abc16
[Core]delete shim pid flag (#21853)
Now we have `startup-token` to identify registering worker, so the shim pid flag is not needed any more.
2022-01-28 21:33:26 +08:00
Sven Mika
7fc1683bab
[RLlib] Some more bandit cleanup/tests. (#21932) 2022-01-28 12:03:26 +01:00
Chen Shen
0ff8bfacec
[resource-reporting 3/n] further clean up LocalResourceManager (#21927)
* clean up

* address comments
2022-01-28 01:50:54 -08:00
Gagandeep Singh
069c499def
Unskipped tests for Windows (#21890)
This is third unskipping PR.
2022-01-27 23:06:44 -08:00
Dmitri Gekhtman
1fee0159b4
[test][k8s] Minor adjustment to manual K8s tests (#21924)
This PR is a minor adjustment to the K8s release tests.

Replace tasks with actors in scale test for reduced flakiness
Use an up-to-date Ray client API.
2022-01-27 20:07:14 -08:00
Guyang Song
937bf6933c
[event] redefine "SetCustomFields" to "UpdateCustomFields" (#21930)
In some cases, we need to add custom fields in different code path. `SetCustomFields` will cover all the existing items, which leads to custom fields losing. This PR redefine `SetCustomFields` to `UpdateCustomFields `.  `UpdateCustomFields ` could keep existing items and merge new items. If the key already exists, replace the value.
2022-01-28 11:54:44 +08:00
Amog Kamsetty
bd726aab02
[Release] Disable caching for ray_lightning (#21886)
Passing tests: https://buildkite.com/ray-project/periodic-ci/builds/2560#_

Add an echo timestamp to the post build commands of the ray lightning release tests to trigger a cluster env rebuild and get the latest versions of ray lightning. Without this, the cluster env gets cached so an outdated version is installed on the cluster that is different than the one on the driver, resulting in the below failures.

Closes #21871
Closes #21863

Also reinstalls the dependencies in the post build commands so old versions are not cached in the Docker images
2022-01-27 17:56:32 -08:00
mwtian
97f7e3d0e6
[e2e] do not terminate in serve_failure smoke test (#21925)
When the script terminates, it will also terminate its cluster including dashboard, which will prevent subsequent job submissions. Other long running e2e tests do not terminate in smoke test mode, so make `serve_failure` behave the same.
2022-01-27 15:36:46 -08:00
Clark Zinzow
09fab70991
[Datasets] [Docs] Fix bug in Datasets locality-aware splitting example (#21937)
Fixes bug in Datasets locality-aware splitting example.
2022-01-27 14:46:04 -08:00
iasoon
b0700e676b
[serve] add root_path setting (#21090)
Support hosting a serve instance under a path prefix.

Some clean-up should still be done for the different overlapping HttpOptions that now exist (host, port, root_path, root_url).
2022-01-27 16:36:22 -06:00
mwtian
559eefd06f
[Doc] update dask version for Ray 1.11.0 (#21933)
This is needed for release 1.11.0.
2022-01-27 13:15:01 -08:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00
Sven Mika
ee41800c16
[RLlib] Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #02. (#21649) 2022-01-27 22:07:05 +01:00
Jun Gong
8ebc50f844
[RLlib] Issue 21334: Fix APPO when kl_loss is enabled. (#21855) 2022-01-27 20:08:58 +01:00
Sriram Sankar
b7391a1c39
[autoscaler] Optimize finding the node id (#21885)
This is a simple refactoring change and my first PR in ray-project. This change moves an if statement outside of a loop. This way the check is not repeated for each iteration.
2022-01-27 10:51:59 -08:00
Victor Yap
8be5f016af
Add NVIDIA_TESLA_A100 to accelerator types (#21558)
Adds Nvidia's A100 to the list of accelerator types. AWS offers this in the p4d.24xlarge instance type.
2022-01-27 10:47:09 -08:00
Jiajun Yao
cea80b1a5b
Don't advertise cpus on gpu nodes for pipelined ingestion tests (#21899)
* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests
2022-01-27 09:17:01 -08:00
Sven Mika
893536ebd9
[RLlib] Move bandits into main agents folder; Make RecSim adapter more accessible; (#21773) 2022-01-27 13:58:12 +01:00
Sven Mika
371fbb17e4
[RLlib] Make policies_to_train more flexible via callable option. (#20735) 2022-01-27 12:17:34 +01:00
Kai Fricke
8dcd4a99ef
[tune/wandb] Use resume=False per default (#21892)
The WandbLoggingCallback is run on the driver side, with the experiment directory was the cwd. Using resume=True will pick up state from other trials (as the file name is global), and thus lead to warning messages. Thus, we should default to resume=False when using the callback.
This PR also incorporates changes from #20966.

Co-authored by: Queimo <queimo@gmx.net>
Co-authored by: Karim <karim.ben.hicham@rwth-aachen.de>
2022-01-27 07:58:01 +00:00
mwtian
634f897cb6
[e2e] improve output dir handling (#21906)
Try to clear the result dir before running the e2e.py script, to avoid failures where the directory already exists, or a file cannot be overwritten due to permission issue.
2022-01-26 23:56:08 -08:00
Chen Shen
bdf9fa337d
[resource-reporting 2/n]separate local resource manager from cluster_resource_scheduler (#21772)
* add

* fix test

* fix more tests
2022-01-26 22:53:05 -08:00
Yi Cheng
7d2237bc9f
[dashboard] Remove unused fields in dashboard actor table for better memory footprint (#21919) 2022-01-26 22:48:17 -08:00
Yi Cheng
e6bbafc17a
[function table] Make sure FunctionsToRun are executed properly on all workers (#21867)
This PR fix the issue that sometimes FunctionsToRun is not executed. We isolated the Functions/Actors in function table, but not the RunctionsToRun. So when doing importing, sometimes, some functions will be missed.
This PR fixed this.
2022-01-26 21:58:43 -08:00
Yi Cheng
3560211ab5
[nightly] Temporarily stops the two pipelines for scheduling until with good setup. (#21922)
Right now these two tests always run out-of-time. We disable them for now and after solid test, we'll reenable them with good parameters.
2022-01-26 20:15:59 -08:00
SangBin Cho
d363c37078
[Core] Stop Ray stop from killing redis that's not started by Ray (#21805)
Currently, `ray stop` logic is vulnerable, and it kills Redis server that's not started by Ray. This PR fixes the issue by better checking the executable name of redis-server (If it is redis-server created by Ray, it should contain Ray specific path copied while wheels are built).

I originally tried to obtain ppid and kill a redis-server only when it is created from the same parent, but it turns out all processes started by ray start has no ppid. 

While the best solution is to have some "process manager" that we can detect redis server started by us, I think there's no need to put lots of efforts here right now since Redis will be removed soon. We will eventually move to a better direction (process manager) to handle this sort of issues.
2022-01-26 18:12:38 -08:00
Dmitri Gekhtman
757b5a88ea
[autoscaler] Cap min and max workers for manually managed on-prem clusters. (#21710)
Closes https://github.com/ray-project/ray/issues/19636 by capping min and max workers for manually managed on-prem clusters to the number of user-specified worker ips.

See https://github.com/ray-project/ray/issues/19636#issuecomment-1016664169 for additional context.
2022-01-26 18:03:55 -08:00
Max Pumperla
b34099e764
[docs] landing page (fixes #21750) (#21859)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-01-26 17:14:25 -08:00
Simon Mo
ac6709f0ba
[Serve] Fix uvicorn duplicate header issue (#21884) 2022-01-26 14:43:18 -08:00
Kai Fricke
3b73a62dad
[ci/release] Increase long running timeout, fix artifacts copy (#21905)
With the new job-based file copy, fetching results takes longer. We thus have to increase the long running update test check times in order not to run into bogus release test failures.
Also fixes artifact uploading issues.
2022-01-26 21:25:03 +00:00
Jiajun Yao
f4e8784890
Remove work stealing (#21878)
This feature is never used so this PR removes it to make the codebase simpler.
Pipelining task submission is still there and will be removed separately.
2022-01-26 13:16:21 -08:00
Clark Zinzow
411bb308dc
[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889) 2022-01-26 12:05:27 -08:00
xwjiang2010
80af046b54
[tune] deflake testBadParams5. (#21898)
The test is timing out during actor creation and ends up not testing the code which is only triggered after a training result is returned back to driver.
Change to use a simpler Trainable.
2022-01-26 19:38:15 +00:00
Archit Kulkarni
11e2a07752
[release] Fix broken pip_download_test.sh script for non-M1 Macs (#21542)
Fixes a typo that caused the script to exit early without running any sanity checks when not using an M1 Mac.
2022-01-26 10:38:52 -08:00
Jun Gong
099c170ab4
[RLlib] Dataset Reader/Writer for RLlib (#21808) 2022-01-26 16:00:46 +01:00
Jun Gong
55f3bcfb2d
[RLlib] Add a logstd term to MARWIL's loss func to encourage exploration. (#21493) 2022-01-26 16:00:17 +01:00
mwtian
1674a17e6f
[e2e] use alternative copy tree function to tolerate output directory that already exists (#21869)
Many release tests have error messages when copying results with `shutil.copytree()`. e.g.
https://buildkite.com/ray-project/periodic-ci/builds/2511#131c0d22-61a3-4dcf-b80a-de37b68ec591/139-450

This PR tries to make the copying process tolerate existing destination directory. There is logic to remove the destination directory, but I'm not sure why it failed.

This error should not be failing the tests though.
2022-01-26 05:10:22 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Lingxuan Zuo
0c33ff718d
Remove generated streaming pb and pom files. (#21851)
There are some auto-generated streaming files, which are not removed. This PR removes them totally.

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-26 10:05:23 +08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
Wilson Wang
30a4761592
Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790)
This patch fixed two issues.

1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py.
2. it is possible that the signal handler can raise another exception during exception handling.
2022-01-25 14:07:26 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00
shrekris-anyscale
e4370720cc
[Serve] Add "Serve" team tag to untagged release tests (#21861) 2022-01-25 11:46:03 -08:00
Dhruv Nair
3d79815cd0
Comet Integration (#20766)
This PR adds a `CometLoggerCallback` to the Tune Integrations, allowing users to log runs from Ray to [Comet](https://www.comet.ml/site/).

Co-authored-by: Michael Cullan <mjcullan@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-25 11:42:00 -08:00
Clark Zinzow
1971a08b7d
[RFC] [Core] Support disabling log redirection via RAY_LOG_TO_STDERR environment variable. (#21767) 2022-01-25 10:52:53 -08:00