Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR.
Here is the original description:
>Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.
There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs.
Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages.
Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`
Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive.
This is because the following exception occurred during handling the user's exception:
```
File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task
File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver
worker.core_worker.push_error(job_id, error_type, message, time.time())
File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128)
An unexpected internal error occurred while the worker was executing a task.
```
This PR fixes this issue.
Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.
Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
RayDP needs to be updated to work with redisless ray.
To be more specific this [line](c08a786770/python/raydp/spark/ray_cluster_master.py (L146)
) needs to be updated to using `node.address`
We should update this after the release with the feature being turned on by default.
Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).
This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.
Full context see https://github.com/ray-project/ray/issues/21791
pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```
Java API looks like:
```java
System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
Ray.init();
```
One example usage is:
```python
ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor.
a2 = A.remote() # a2 is a non-detached actor.
```
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
By default, ~/ray_results/exp_name/trial_name/checkpoint_name.
Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory.
Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process.
This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.
Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:
- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld.
This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`.
This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.
`test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.
This is the second last PR to improve `ActorDiedError` exception.
This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception.
After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens.
TODO
- [x] Fix test failures
- [x] Add unit tests
- [x] Fix Java/cpp cases
Follow up PRs
- Not allowing nullptr for RayErrorInfo input.
This fixes the event stats segfault. I made a consistent repro script that can have 3~5 segfault per each `placement_group_mini_integration_test` and verified this fixes the issue by running it more than 5 times.
Note that when GCS HA is on, the same repro uncovered a different shutdown bug from the pubsub (resubscription seems to happen, and currently idempotency is not supported). I will make a separate issue so that @mwtian can handle it.
## Root cause
The problem was that we are calling `gcs_client_->Disconnect()` from the task execution event loop. Note that gcs_client_ uses io_service as its main event loop, meaning now Disconnect is called in a different thread.
If you look at Disconnect method, it resets lots of pointers. Resetting pointers of attributes that can be used by other threads -> segfault.
This fixes the issue by "not resetting pointers". We do this after we join the io service thread, cuz it can now guarantee the gcs client won't be used by other threads (resetting gcs_client_ is probably not necessary, but I added it as a safety check).
Note that gcs_client is currently written in very complicated way, and we need some refactoring to make thread-safety & code structure cleaner. But I will do this task once I have bandwidth.
To make other system or internal project reuse ray deps bazel function, we need change this local accessing style to global accessing with ray-project namespace.
Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>