Commit graph

11080 commits

Author SHA1 Message Date
Sven Mika
371fbb17e4
[RLlib] Make policies_to_train more flexible via callable option. (#20735) 2022-01-27 12:17:34 +01:00
Kai Fricke
8dcd4a99ef
[tune/wandb] Use resume=False per default (#21892)
The WandbLoggingCallback is run on the driver side, with the experiment directory was the cwd. Using resume=True will pick up state from other trials (as the file name is global), and thus lead to warning messages. Thus, we should default to resume=False when using the callback.
This PR also incorporates changes from #20966.

Co-authored by: Queimo <queimo@gmx.net>
Co-authored by: Karim <karim.ben.hicham@rwth-aachen.de>
2022-01-27 07:58:01 +00:00
mwtian
634f897cb6
[e2e] improve output dir handling (#21906)
Try to clear the result dir before running the e2e.py script, to avoid failures where the directory already exists, or a file cannot be overwritten due to permission issue.
2022-01-26 23:56:08 -08:00
Chen Shen
bdf9fa337d
[resource-reporting 2/n]separate local resource manager from cluster_resource_scheduler (#21772)
* add

* fix test

* fix more tests
2022-01-26 22:53:05 -08:00
Yi Cheng
7d2237bc9f
[dashboard] Remove unused fields in dashboard actor table for better memory footprint (#21919) 2022-01-26 22:48:17 -08:00
Yi Cheng
e6bbafc17a
[function table] Make sure FunctionsToRun are executed properly on all workers (#21867)
This PR fix the issue that sometimes FunctionsToRun is not executed. We isolated the Functions/Actors in function table, but not the RunctionsToRun. So when doing importing, sometimes, some functions will be missed.
This PR fixed this.
2022-01-26 21:58:43 -08:00
Yi Cheng
3560211ab5
[nightly] Temporarily stops the two pipelines for scheduling until with good setup. (#21922)
Right now these two tests always run out-of-time. We disable them for now and after solid test, we'll reenable them with good parameters.
2022-01-26 20:15:59 -08:00
SangBin Cho
d363c37078
[Core] Stop Ray stop from killing redis that's not started by Ray (#21805)
Currently, `ray stop` logic is vulnerable, and it kills Redis server that's not started by Ray. This PR fixes the issue by better checking the executable name of redis-server (If it is redis-server created by Ray, it should contain Ray specific path copied while wheels are built).

I originally tried to obtain ppid and kill a redis-server only when it is created from the same parent, but it turns out all processes started by ray start has no ppid. 

While the best solution is to have some "process manager" that we can detect redis server started by us, I think there's no need to put lots of efforts here right now since Redis will be removed soon. We will eventually move to a better direction (process manager) to handle this sort of issues.
2022-01-26 18:12:38 -08:00
Dmitri Gekhtman
757b5a88ea
[autoscaler] Cap min and max workers for manually managed on-prem clusters. (#21710)
Closes https://github.com/ray-project/ray/issues/19636 by capping min and max workers for manually managed on-prem clusters to the number of user-specified worker ips.

See https://github.com/ray-project/ray/issues/19636#issuecomment-1016664169 for additional context.
2022-01-26 18:03:55 -08:00
Max Pumperla
b34099e764
[docs] landing page (fixes #21750) (#21859)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-01-26 17:14:25 -08:00
Simon Mo
ac6709f0ba
[Serve] Fix uvicorn duplicate header issue (#21884) 2022-01-26 14:43:18 -08:00
Kai Fricke
3b73a62dad
[ci/release] Increase long running timeout, fix artifacts copy (#21905)
With the new job-based file copy, fetching results takes longer. We thus have to increase the long running update test check times in order not to run into bogus release test failures.
Also fixes artifact uploading issues.
2022-01-26 21:25:03 +00:00
Jiajun Yao
f4e8784890
Remove work stealing (#21878)
This feature is never used so this PR removes it to make the codebase simpler.
Pipelining task submission is still there and will be removed separately.
2022-01-26 13:16:21 -08:00
Clark Zinzow
411bb308dc
[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889) 2022-01-26 12:05:27 -08:00
xwjiang2010
80af046b54
[tune] deflake testBadParams5. (#21898)
The test is timing out during actor creation and ends up not testing the code which is only triggered after a training result is returned back to driver.
Change to use a simpler Trainable.
2022-01-26 19:38:15 +00:00
Archit Kulkarni
11e2a07752
[release] Fix broken pip_download_test.sh script for non-M1 Macs (#21542)
Fixes a typo that caused the script to exit early without running any sanity checks when not using an M1 Mac.
2022-01-26 10:38:52 -08:00
Jun Gong
099c170ab4
[RLlib] Dataset Reader/Writer for RLlib (#21808) 2022-01-26 16:00:46 +01:00
Jun Gong
55f3bcfb2d
[RLlib] Add a logstd term to MARWIL's loss func to encourage exploration. (#21493) 2022-01-26 16:00:17 +01:00
mwtian
1674a17e6f
[e2e] use alternative copy tree function to tolerate output directory that already exists (#21869)
Many release tests have error messages when copying results with `shutil.copytree()`. e.g.
https://buildkite.com/ray-project/periodic-ci/builds/2511#131c0d22-61a3-4dcf-b80a-de37b68ec591/139-450

This PR tries to make the copying process tolerate existing destination directory. There is logic to remove the destination directory, but I'm not sure why it failed.

This error should not be failing the tests though.
2022-01-26 05:10:22 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Lingxuan Zuo
0c33ff718d
Remove generated streaming pb and pom files. (#21851)
There are some auto-generated streaming files, which are not removed. This PR removes them totally.

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-26 10:05:23 +08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
Wilson Wang
30a4761592
Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790)
This patch fixed two issues.

1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py.
2. it is possible that the signal handler can raise another exception during exception handling.
2022-01-25 14:07:26 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00
shrekris-anyscale
e4370720cc
[Serve] Add "Serve" team tag to untagged release tests (#21861) 2022-01-25 11:46:03 -08:00
Dhruv Nair
3d79815cd0
Comet Integration (#20766)
This PR adds a `CometLoggerCallback` to the Tune Integrations, allowing users to log runs from Ray to [Comet](https://www.comet.ml/site/).

Co-authored-by: Michael Cullan <mjcullan@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-25 11:42:00 -08:00
Clark Zinzow
1971a08b7d
[RFC] [Core] Support disabling log redirection via RAY_LOG_TO_STDERR environment variable. (#21767) 2022-01-25 10:52:53 -08:00
Gagandeep Singh
395297a9bd
Unskip tests for Windows in test_output (#21775) 2022-01-25 09:25:01 -08:00
Matti Picus
d3d1e8559c
enable passing metric tests on windows (#21755)
Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR.

Here is the original description:
>Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.
2022-01-25 09:20:16 -08:00
Sven Mika
d5bfb7b7da
[RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652) 2022-01-25 14:16:58 +01:00
SangBin Cho
b2cd123522
[Runtime Env] Suppress the log messages when RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=0 (#21806)
There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs. 

Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages. 

Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`
2022-01-25 00:42:52 -08:00
Gagandeep Singh
290f3172ad
Unskipped tests for Windows in test_client.py (#21824)
All the tests in `test_client.py` pass on Windows without issues, so unskipping them here.
2022-01-24 22:51:54 -08:00
Lixin Wei
bc55a958c4
[Core] Support UTF-8 Actor Creation Exceptions (#21807)
Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive.

This is because the following exception occurred during handling the user's exception:
```
  File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task
  File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver
    worker.core_worker.push_error(job_id, error_type, message, time.time())
  File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128)
An unexpected internal error occurred while the worker was executing a task.
```

This PR fixes this issue.
2022-01-24 20:27:43 -08:00
Guyang Song
089f49f554
[doc] fix doc of container-based runtime env (#21815) 2022-01-25 12:23:15 +08:00
isaac-vidas
236fe58259
[Doc] Update requests calls to ray job submission api (#21802) 2022-01-24 17:44:31 -08:00
Max Pumperla
7953c9ca57
[docs] integrate algolia docsearch, move to sphinx panels (#21814) 2022-01-24 17:00:41 -08:00
Andrew A. Naguib
f026376556
[Tune] PTL replace deprecated running_sanity_check with sanity_checking (#21831)
`running_sanity_check` was deprecated and removed in https://github.com/PyTorchLightning/pytorch-lightning/pull/9209 in favor of `sanity_checking`
2022-01-24 16:14:05 -08:00
Siyuan (Ryans) Zhuang
99b287d236
[workflow] Fix workflow recovery issue due to a bug of dynamic output (#21571)
* Fix workflow recovery issue due to a bug of dynamic output

* add tests
2022-01-24 15:34:57 -08:00
DK.Pino
c2199a50e3
[Placement Group] Fix remove pg flaky when worker startup slow (#20474)
Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.
2022-01-24 15:30:04 -08:00
SangBin Cho
7d4287a6ab
[Test] Move long running tests to run everyday (#21813)
Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.
2022-01-24 15:10:27 -08:00
SangBin Cho
ac5f38d7fd
[Test] Fix dask on ray test on K8s (#21816)
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
2022-01-24 15:09:22 -08:00
mwtian
a10d05ce27
[Bootstrap] fix log format (#21826) 2022-01-24 15:06:41 -08:00
Yi Cheng
57afb2f75a
[gcs/ha] Skip raydb test when it's gcs bootstrap mode (#21771)
RayDP needs to be updated to work with redisless ray.
To be more specific this [line](c08a786770/python/raydp/spark/ray_cluster_master.py (L146)
) needs to be updated to using `node.address`

We should update this after the release with the feature being turned on by default.
2022-01-24 14:43:31 -08:00
shrekris-anyscale
03d93ba7ee
Add a new End-to-End tutorial in Serve that walks users through deploying a model (#20765)
Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2022-01-24 16:36:04 -06:00
Sven Mika
c288b97e5f
[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670) 2022-01-24 19:38:21 +01:00
SangBin Cho
2010f13175
Fix dashboard test bug (#21742)
Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).
2022-01-24 11:38:51 -06:00
Antoni Baum
850eb88cde
[tune] Fix analysis without registered trainable (#21475)
This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.
2022-01-24 08:27:08 -08:00
Guyang Song
08b8f3065b
add runtime env code owners (#21803) 2022-01-24 19:25:16 +08:00
Guyang Song
f8e41215b3
[1/n][cross-language runtime env] runtime env protobuf refactor (#21551)
We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf.
Related issue #21731
2022-01-24 19:24:59 +08:00
SangBin Cho
6b4aac7a08
Promote unstable tests to stable (#21811)
Promote tests that have passed 100% last 1 week to stable
2022-01-24 02:10:37 -08:00