This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```
Java API looks like:
```java
System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
Ray.init();
```
One example usage is:
```python
ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor.
a2 = A.remote() # a2 is a non-detached actor.
```
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
By default, ~/ray_results/exp_name/trial_name/checkpoint_name.
Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory.
Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process.
This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.
Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld.
This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`.
This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.
`test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.
This is the second last PR to improve `ActorDiedError` exception.
This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception.
After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens.
TODO
- [x] Fix test failures
- [x] Add unit tests
- [x] Fix Java/cpp cases
Follow up PRs
- Not allowing nullptr for RayErrorInfo input.
GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.
Making some minor fixes.
1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size.
2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct.
3. Add `model.train()` for generality.
4. Remove `smoke-test` flag since it's not really being used.
- Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers.
- Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.
When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.
This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).
GCS, when running as an individual component, can cause other components to fail in case of crashes.
Here are two main cases covered in this patch:
1. monitor.py will raise an exception when disconnected from GCS.
2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start.
In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small.
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same).
This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...).
It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.
See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path.
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).
This PR makes sure each test case shuts down its Ray cluster.
This PR fixed and reenabled tests in HA mode
- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api
- //python/ray/tests:test_ray_debugger
When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern.
For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function.
This PR uses hex of job id to avoid this.
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.
Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Why are these changes needed?
fix dlmalloc allocate bug, details in here #21310
* fix dlmalloc bug
* make lint happy
* make lint happy
* fix by comment
* use _check_spilled_mb
* add cpp UT
If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster.
This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.
Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.