Commit graph

11059 commits

Author SHA1 Message Date
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
Wilson Wang
30a4761592
Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790)
This patch fixed two issues.

1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py.
2. it is possible that the signal handler can raise another exception during exception handling.
2022-01-25 14:07:26 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00
shrekris-anyscale
e4370720cc
[Serve] Add "Serve" team tag to untagged release tests (#21861) 2022-01-25 11:46:03 -08:00
Dhruv Nair
3d79815cd0
Comet Integration (#20766)
This PR adds a `CometLoggerCallback` to the Tune Integrations, allowing users to log runs from Ray to [Comet](https://www.comet.ml/site/).

Co-authored-by: Michael Cullan <mjcullan@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-25 11:42:00 -08:00
Clark Zinzow
1971a08b7d
[RFC] [Core] Support disabling log redirection via RAY_LOG_TO_STDERR environment variable. (#21767) 2022-01-25 10:52:53 -08:00
Gagandeep Singh
395297a9bd
Unskip tests for Windows in test_output (#21775) 2022-01-25 09:25:01 -08:00
Matti Picus
d3d1e8559c
enable passing metric tests on windows (#21755)
Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR.

Here is the original description:
>Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.
2022-01-25 09:20:16 -08:00
Sven Mika
d5bfb7b7da
[RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652) 2022-01-25 14:16:58 +01:00
SangBin Cho
b2cd123522
[Runtime Env] Suppress the log messages when RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=0 (#21806)
There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs. 

Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages. 

Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`
2022-01-25 00:42:52 -08:00
Gagandeep Singh
290f3172ad
Unskipped tests for Windows in test_client.py (#21824)
All the tests in `test_client.py` pass on Windows without issues, so unskipping them here.
2022-01-24 22:51:54 -08:00
Lixin Wei
bc55a958c4
[Core] Support UTF-8 Actor Creation Exceptions (#21807)
Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive.

This is because the following exception occurred during handling the user's exception:
```
  File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task
  File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver
    worker.core_worker.push_error(job_id, error_type, message, time.time())
  File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128)
An unexpected internal error occurred while the worker was executing a task.
```

This PR fixes this issue.
2022-01-24 20:27:43 -08:00
Guyang Song
089f49f554
[doc] fix doc of container-based runtime env (#21815) 2022-01-25 12:23:15 +08:00
isaac-vidas
236fe58259
[Doc] Update requests calls to ray job submission api (#21802) 2022-01-24 17:44:31 -08:00
Max Pumperla
7953c9ca57
[docs] integrate algolia docsearch, move to sphinx panels (#21814) 2022-01-24 17:00:41 -08:00
Andrew A. Naguib
f026376556
[Tune] PTL replace deprecated running_sanity_check with sanity_checking (#21831)
`running_sanity_check` was deprecated and removed in https://github.com/PyTorchLightning/pytorch-lightning/pull/9209 in favor of `sanity_checking`
2022-01-24 16:14:05 -08:00
Siyuan (Ryans) Zhuang
99b287d236
[workflow] Fix workflow recovery issue due to a bug of dynamic output (#21571)
* Fix workflow recovery issue due to a bug of dynamic output

* add tests
2022-01-24 15:34:57 -08:00
DK.Pino
c2199a50e3
[Placement Group] Fix remove pg flaky when worker startup slow (#20474)
Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.
2022-01-24 15:30:04 -08:00
SangBin Cho
7d4287a6ab
[Test] Move long running tests to run everyday (#21813)
Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.
2022-01-24 15:10:27 -08:00
SangBin Cho
ac5f38d7fd
[Test] Fix dask on ray test on K8s (#21816)
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
2022-01-24 15:09:22 -08:00
mwtian
a10d05ce27
[Bootstrap] fix log format (#21826) 2022-01-24 15:06:41 -08:00
Yi Cheng
57afb2f75a
[gcs/ha] Skip raydb test when it's gcs bootstrap mode (#21771)
RayDP needs to be updated to work with redisless ray.
To be more specific this [line](c08a786770/python/raydp/spark/ray_cluster_master.py (L146)
) needs to be updated to using `node.address`

We should update this after the release with the feature being turned on by default.
2022-01-24 14:43:31 -08:00
shrekris-anyscale
03d93ba7ee
Add a new End-to-End tutorial in Serve that walks users through deploying a model (#20765)
Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2022-01-24 16:36:04 -06:00
Sven Mika
c288b97e5f
[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670) 2022-01-24 19:38:21 +01:00
SangBin Cho
2010f13175
Fix dashboard test bug (#21742)
Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).
2022-01-24 11:38:51 -06:00
Antoni Baum
850eb88cde
[tune] Fix analysis without registered trainable (#21475)
This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.
2022-01-24 08:27:08 -08:00
Guyang Song
08b8f3065b
add runtime env code owners (#21803) 2022-01-24 19:25:16 +08:00
Guyang Song
f8e41215b3
[1/n][cross-language runtime env] runtime env protobuf refactor (#21551)
We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf.
Related issue #21731
2022-01-24 19:24:59 +08:00
SangBin Cho
6b4aac7a08
Promote unstable tests to stable (#21811)
Promote tests that have passed 100% last 1 week to stable
2022-01-24 02:10:37 -08:00
Chen Shen
a60251f47a
[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224)
* add changes

* as title

* fix

* max to min

* fix tests
2022-01-24 01:52:59 -08:00
Shawn
6603ad450a
[Java] print hang test case name (#21804)
* print hang test case name

* use getFullTestName
2022-01-23 23:56:44 -08:00
SangBin Cho
1ae14ec513
[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774)
This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing

The fully working e2e prototype is here (it passes all tests): cdad913883

This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. 

<img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">
2022-01-23 21:11:32 -08:00
SangBin Cho
babc03edf2
Add a threaded actor k8s test (#21739)
Add threaded actor flaky test to k8s.
2022-01-23 20:12:57 -08:00
DK.Pino
8cd7a5c438
[Placement Group] Make placement group commit resource rpc request batched (#21240)
This is one part of this refactor, #20715 , make the commit resource RPC requests batched per node.
2022-01-23 06:16:09 -08:00
Jiao
5d382cfeb3
[nit] remove decorator in test_cli.py (#21792)
Full context see https://github.com/ray-project/ray/issues/21791

pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.
2022-01-23 06:05:05 -08:00
Lingxuan Zuo
ec62d7f510
[Streaming]Farewell : remove all of streaming related from ray repo. (#21770)
New repo url is https://github.com/ray-project/mobius

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-23 17:53:41 +08:00
Gagandeep Singh
2da2ac52ce
Unskipped test_worker_stdout (#21708) 2022-01-22 02:43:03 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
Gagandeep Singh
b00385f9a2
Using a deterministic approach to check connections in test_job_timestamps (#21693) 2022-01-21 20:18:05 -08:00
Archit Kulkarni
c4bf68a083
[runtime env] Short-circuit protobuf serialization/deserialization for empty runtime envs (#21788)
* Fix check for empty runtime_env ("{}", not "")

* also check for ""
2022-01-22 13:07:03 +09:00
xwjiang2010
0abcd5eea5
[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658)
By default, ~/ray_results/exp_name/trial_name/checkpoint_name.
Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory.
Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process.
This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.
2022-01-21 17:56:05 -08:00
mwtian
e8ce01c525
[Dashboard] offload blocking work to a thread pool (#21762)
Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.
2022-01-21 17:55:11 -08:00
matthewdeng
8119b62640
[train] refactor callback logdir and results preprocessors (#21468)
* [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors

* simplify profiler

* read on get_and_clear_profile_traces

* refactor callbacks

* remove var

* Update python/ray/train/callbacks/logging.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* Update python/ray/train/callbacks/results_prepocessors/keys.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* address comments; add tests

* fix test

* address comments

* docs

* address comments'

* fix test

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-21 17:23:34 -08:00
matthewdeng
165a025641
[train] update worker batch size docs (#21761)
Making it explicit how the user should think about batch size for PyTorch in a distributed setting, similar to what's already done for TensorFlow.

![image](https://user-images.githubusercontent.com/3967392/150421340-df73f574-8531-4626-88a6-b80442ea6b7f.png)
2022-01-21 17:22:47 -08:00
Junwen Yao
216c4bf9a6
[Serve] warn when serve.start() with different options (#21562) 2022-01-21 15:51:26 -08:00
Max Pumperla
f9b71a8bf6
[docs] new structure (#21776)
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:

- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
2022-01-21 15:42:05 -08:00
shrekris-anyscale
45eebdd6e3
[Serve] Handle name collisions when unzipping directories in unzip_package (#21723)
Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld.

This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`.

This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.
2022-01-21 15:27:28 -06:00
Adam Golinski
2954bf9a48
[docs][tune] Fix typo in schedulers.rst (#21777)
Fix typo in schedulers.rst
2022-01-21 13:21:01 -08:00
Chen Shen
c401e94d2e
[resource-reporting] refactor resource scheduler 1/n #21732 2022-01-21 12:39:21 -08:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00