ray/dashboard
Archit Kulkarni 058c239cf1
[runtime env] Test common failure scenarios (#25977)
Tests the following failure scenarios:
- Fail to upload data in `ray.init()` (`working_dir`, `py_modules`)
- Eager install fails in `ray.init()` for some other reason (bad `pip` package)
- Fail to download data from GCS (`working_dir`)

Improves the following error message cases:
- Return RuntimeEnvSetupError on failure to upload working_dir or py_modules
- Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup

Not covered in this PR:
- RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.)
- Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.)

The approach is to use environment variables to induce failures in various places.  The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be.  I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.
2022-08-15 11:35:56 -05:00
..
client [Dashboard] Fix edge cases for log file names in the dashboard log viewer (#27772) 2022-08-12 09:39:54 -07:00
modules [runtime env] Test common failure scenarios (#25977) 2022-08-15 11:35:56 -05:00
tests Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308)" (#27613) 2022-08-08 06:38:19 -07:00
__init__.py [Dashboard] New dashboard skeleton (#9099) 2020-07-27 11:34:47 +08:00
agent.py [runtime env] Test common failure scenarios (#25977) 2022-08-15 11:35:56 -05:00
BUILD Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525) 2022-07-18 21:21:19 -07:00
consts.py [serve] Make serve agent not blocking when GCS is down. (#27526) 2022-08-08 16:29:42 -07:00
dashboard.py [Usage Stats] Record usage stats when dashboard disabled (#26042) 2022-07-28 23:01:49 -07:00
datacenter.py [Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818) 2022-07-26 03:10:57 -07:00
head.py [core] Support external ray dashboard URL (#27396) 2022-08-05 19:33:10 -07:00
http_server_agent.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
http_server_head.py [Core][cli][usability] ray stop prints errors during graceful shutdown (#25686) 2022-06-27 08:14:59 -07:00
k8s_utils.py [dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688) 2022-03-01 17:15:59 -08:00
memory_utils.py [State Observability] pre-alpha documentation (#26560) 2022-07-26 05:49:28 -07:00
optional_deps.py Make it so pydantic is required before we launch dashboard api server (#27345) 2022-08-03 14:24:51 -07:00
optional_utils.py [serve] Make serve agent not blocking when GCS is down. (#27526) 2022-08-08 16:29:42 -07:00
state_aggregator.py Convert job_manager to be async (#27123) 2022-08-05 19:33:49 -07:00
utils.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00