Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com
Why are these changes needed?
Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.
Since usage stats are recorded from the dashboard (which will become API server), it is not collected when the dashboard is not included (include_dashboard=False).
This PR fixes the issues by
change dashboard -> API server (to avoid confusing users that dashboard is still started when include_dashboard=False)
Only load modules that are irrelevant to the dashboard from the API server, so it will have the same impact as no dashboard.
## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.
This PR expose them in the endpoint in dashboard and dashboard agent.
For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.
For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.
We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead
If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.
Why are these changes needed?
This is to address false alarms on subprocesses exiting when killed by ray stop with SIGTERM.
What has been changed?
Added signal handlers for some of the subprocesses:
dashboard (head)
log monitor
ray client server
Changed the --block semantics and prompt messages.
Related issue number
Closes#25518
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
This PR implements ray list tasks and ray list objects APIs.
NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.
Also, suppress exceptions when sync subscribers are disconnected from GCS.
I can push changes in this PR to #21513 as well.
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
Dashboard contains resource reporter and actor subscribers. Dashboard agent has resource report publisher. So GCS pubsub needs to support these channel types.
Also refactor GCS AIO subscribers to have each subscriber per channel. This matches the API of GCS sync subscribers, and make subscribing with multiple channels easier.
Using Ray pubsub for publishing and subscribing logs via GCS, from Python worker, log importer, dashboard and unit tests.
This change is guarded behind the RAY_gcs_grpc_based_pubsub feature flag.
## Why are these changes needed?
This is a part of redis removal. This PR remove redis kv in function table.
rpush related code is not updated in this PR.
## Related issue number
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.
Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.
Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.
GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.
## Related issue number
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR
## Related issue number
#19443
## Why are these changes needed?
It's part of redis removal project. This PR focus on using gcs kv in internal kv.
- gcs client is introduced
- internal kv is updated to use gcs rpc client based kv
- related code got updated.
The other PR will update components using redis to use internal kv.
## Related issue number
https://github.com/ray-project/ray/issues/19443
* Dashboard select port; Fix dashboard may hangs when exit
* Add test case
* Fix
* Fix test_stats_collector.py::test_get_all_node_details
* Refine dashboard error messages
* Refine code
* Refine code
* Show last 10 lines of dashboard log if start dashboard failed
* Fix ValueError: too many values to unpack (expected 2) when getsockname
* Fix test_multi_node_3.py::test_calling_start_ray_head may fail
* Fix Windows CI
* Disable dashboard in C++ test
* Refine code
* Fix issue 7084
Co-authored-by: 刘宝 <po.lb@antfin.com>
* Add RAY_NODE_ID environment var to agent
* Node ralated data use node id as key
* ray.init() return node id; Pass test_reporter.py
* Fix lint & CI
* Fix comments
* Minor fixes
* Fix CI
* Add const to ClientID in AgentManager::Options
* Use fstring
* Add comments
* Fix lint
* Add test_multi_nodes_info
Co-authored-by: 刘宝 <po.lb@antfin.com>
* Improve reporter module
* Add test_node_physical_stats to test_reporter.py
* Add test_class_method_route_table to test_dashboard.py
* Add stats_collector module for dashboard
* Subscribe actor table data
* Add log module for dashboard
* Only enable test module in some test cases
* CI run all dashboard tests
* Reduce test timeout to 10s
* Use fstring
* Remove unused code
* Remove blank line
* Fix dashboard tests
* Fix asyncio.create_task not available in py36; Fix lint
* Add format_web_url to ray.test_utils
* Update dashboard/modules/reporter/reporter_head.py
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
* Add DictChangeItem type for Dict change
* Refine logger.exception
* Refine GET /api/launch_profiling
* Remove disable_test_module fixture
* Fix test_basic may fail
Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>