Skips 404s on node termination for GCP node provider.
Also resets internal "self.nodes_to_terminate" state at the start of an autoscaler iteration -- that's necessary for correct cleanup in the event of failed node termination.
If using the DataParallelTrainer, since we are running the BackendExecutor in a Trainable actor already, we don't need to create a new actor.
However if using Ray Train directly, we still want to run BackendExecutor in an actor for performance with Ray Client.
This PR does some refactoring to support both cases.
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster. One gap in this is if the local Python module is in the form of a wheel (`.whl` file.) This PR adds the missing support for uploading and installing the `.whl` file.
The Ray Dashboard starts Serve in the `"_ray_internal_dashboard"` namespace. However, Serve by default starts in the `"serve"` namespace. This causes surprising behavior when working with the Serve CLI and REST API.
This change make the Ray Dashboard start Serve in the `"serve"` namespace, allowing the REST API to work intuitively with the Python API.
Apparently, ray gets imported somewhere before running the client runner (maybe from an anyscale package). This means that we need to reload the ray package after installing a matching local ray wheel.
Additionally, job submission should also install a matching local ray to match with the job submission server.
The current multi node tests use a hardcoded mapping for local development mounts. With this PR, a new environment variable is introduced to be able to control this dynamically. Additionally, some minor improvements to the test utilities and monitor are added.
GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.
Next move of #22932. This pr replace unordered_map to flat_hash_map in core worker and object manager module.
Also some interfaces, like GetAllReferenceCounts, which expose user interfaces in Java/Python, is exclusive as it's a little bit complicated. We save them to deal with pg together.
The follow-up PRs would be migrating in reference counting, placement group and others.
In this PR, we disabled every 10min FullGC which is not triggered by a global gc event in Java worker. As detail, we added `triggered_by_global_gc` flag to indicate whether the gc event is triggered by a global gc event. If it's triggered by global gc, we still need to do FullGC.
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>