Why are these changes needed?
This is to address false alarms on subprocesses exiting when killed by ray stop with SIGTERM.
What has been changed?
Added signal handlers for some of the subprocesses:
dashboard (head)
log monitor
ray client server
Changed the --block semantics and prompt messages.
Related issue number
Closes#25518
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
To avoid this error:
(raylet) Traceback (most recent call last):
(raylet) File "/home/iamhatesz/.pyenv/versions/alan-brain-py3.9/lib/python3.9/site-packages/ray/dashboard/agent.py", line 407, in <module>
(raylet) gcs_publisher = GcsPublisher(args.gcs_address)
(raylet) TypeError: __init__() takes 1 positional argument but 2 were given
As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.
Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.
Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.
Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.
GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.
## Related issue number
* Dashboard select port; Fix dashboard may hangs when exit
* Add test case
* Fix
* Fix test_stats_collector.py::test_get_all_node_details
* Refine dashboard error messages
* Refine code
* Refine code
* Show last 10 lines of dashboard log if start dashboard failed
* Fix ValueError: too many values to unpack (expected 2) when getsockname
* Fix test_multi_node_3.py::test_calling_start_ray_head may fail
* Fix Windows CI
* Disable dashboard in C++ test
* Refine code
* Fix issue 7084
Co-authored-by: 刘宝 <po.lb@antfin.com>
* In Progress.
* Done.
* Fix the issue.
* Add wait for condition because logs are not written right away now.
* debug string.
* lint.
* Fix flaky test.
* Fix issues.
* Fix test.
* lint.
* Improve reporter module
* Add test_node_physical_stats to test_reporter.py
* Add test_class_method_route_table to test_dashboard.py
* Add stats_collector module for dashboard
* Subscribe actor table data
* Add log module for dashboard
* Only enable test module in some test cases
* CI run all dashboard tests
* Reduce test timeout to 10s
* Use fstring
* Remove unused code
* Remove blank line
* Fix dashboard tests
* Fix asyncio.create_task not available in py36; Fix lint
* Add format_web_url to ray.test_utils
* Update dashboard/modules/reporter/reporter_head.py
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
* Add DictChangeItem type for Dict change
* Refine logger.exception
* Refine GET /api/launch_profiling
* Remove disable_test_module fixture
* Fix test_basic may fail
Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
* Use new dashboard if environment var RAY_USE_NEW_DASHBOARD exists; new dashboard startup
* Make fake client/build/static directory for dashboard
* Add test_dashboard.py for new dashboard
* Travis CI enable new dashboard test
* Update new dashboard
* Agent manager service
* Add agent manager
* Register agent to agent manager
* Add a new line to the end of agent_manager.cc
* Fix merge; Fix lint
* Update dashboard/agent.py
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* Update dashboard/head.py
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* Fix bug
* Add tests for dashboard
* Fix
* Remove const from Process::Kill() & Fix bugs
* Revert error check of execute_after
* Raise exception from DashboardAgent.run
* Add more tests.
* Fix compile on Linux
* Use dict comprehension instead of dict(generator)
* Fix lint
* Fix windows compile
* Fix lint
* Test Windows CI
* Revert "Test Windows CI"
This reverts commit 945e01051ec95cff5fcc1c0bc37045b46e7ad9a6.
* Fix ParseWindowsCommandLine bug
* Update src/ray/util/util.cc
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>