hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Alan Guo	be92dd72d5	[Dashboard] Fix edge cases for log file names in the dashboard log viewer (#27772 )	2022-08-12 09:39:54 -07:00
Alan Guo	c083ca5871	Add GPU info to new dashboard (#27074 ) Support a GPU column for the new dashboard Have first node be default expanded Signed-off-by: Alan Guo aguo@anyscale.com fixes #13889 Addresses comment from #26996	2022-08-02 15:32:55 -07:00
Alan Guo	d25a3ff80a	[Dashboard] Fix node rows not being removed correctly when using filters (#27205 )	2022-07-28 13:53:47 -07:00
Alan Guo	a7dca17973	Make New Dashboard the default dashboard (#26996 ) Add UsageStats alert to new dashboard Update wording of "back to legacy dashboard", "try new dashboard" buttons Signed-off-by: Alan Guo aguo@anyscale.com	2022-07-27 07:04:34 -07:00
Alan Guo	5d6bc5360d	Fix the jobs tab in the beta dashboard and fill it with data from both "submission" jobs and "driver" jobs (#25902 ) ## Why are these changes needed? - Fixes the jobs tab in the new dashboard. Previously it didn't load. - Combines the old job concept, "driver jobs" and the new job submission conception into a single concept called "jobs". Jobs tab shows information about both jobs. - Updates all job APIs: They now returns both submission jobs and driver jobs. They also contains additional data in the response including "id", "job_id", "submission_id", and "driver". They also accept either job_id or submission_id as input. - Job ID is the same as the "ray core job id" concept. It is in the form of "0100000" and is the primary id to represent jobs. - Submission ID is an ID that is generated for each ray job submission. It is in the form of "raysubmit_12345...". It is a secondary id that can be used if a client needs to provide a self-generated id. or if the job id doesn't exist (ex: if the submission job doesn't create a ray driver) This PR has 2 deprecations - The `submit_job` sdk now accepts a new kwarg `submission_id`. `job_id is deprecated. - The `ray job submit` CLI now accepts `--submission-id`. `--job-id` is deprecated. This PR has 4 backwards incompatible changes: - list_jobs sdk now returns a list instead of a dictionary - the `ray job list` CLI now prints a list instead of a dictionary - The `/api/jobs` endpoint returns a list instead of a dictionary - The `POST api/jobs` endpoint (submit job) now returns a json with `submission_id` field instead of `job_id`.	2022-07-27 02:39:52 -07:00
Alan Guo	50b20809b8	[Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818 ) Signed-off-by: Alan Guo <aguo@anyscale.com> ## Why are these changes needed? Reduces memory footprint of the dashboard. Also adds some cleanup to the errors data. Also cleans up actor cache by removing dead actors from the cache. Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually. <img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png"> ## Related issue number fixes #23680 fixes #22027 fixes #24272	2022-07-26 03:10:57 -07:00
Archit Kulkarni	084f06f49a	[Doc] [Job submission] [Dashboard] Add tip for long runtime_env installation and improve error (#26911 ) # Why are these changes needed? The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing. This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information. The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever. # Related issue number Closes #26332	2022-07-25 23:32:27 -07:00
Guyang Song	bf97a6944b	[Dashboard] Actor Table UI Optimize (#26785 ) Co-authored-by: 多牧 <xuzhi.mxz@antfin.com>	2022-07-25 18:49:48 +08:00
Jules S. Damji	55368402ee	added summary why and when to use bulk vs streaming data ingest (#26637 )	2022-07-17 18:46:58 -07:00
Alan Guo	7ad3a247bf	[Dashboard] [Frontend] Add workers to the main node tab in the New Dashboard UI (#26274 ) The old dashboard UI was much easier at seeing all the work across all workers because workers were shown along side nodes in the main nodes page. This change brings the same functionality to the new Dashboard UI. Some changes in this PR: Factor out the NodeRow into its own component and into its own file. Introduce WorkerRow which shows information about a worker Updates the heading of the table column because the column will show different data depending on if its a node row or a worker row. Makes sure we're rounding percentages to a single decimal place. Logs button for worker row will go to the logs page and filter out just the log files related to that worker. Update the api for fetching nodes into fetching nodes + workers. fix bug where object store memory was not showing the total size but instead the remaining size	2022-07-12 16:28:08 -07:00
Guyang Song	d1d5fe61c2	[Dashboard][Frontend] Worker table enhancement (#25934 )	2022-06-21 14:09:48 +08:00
Guyang Song	e13cc4088a	[Dashboard] Don't sort node list by defult (#25884 )	2022-06-20 11:35:12 +08:00
mwtian	f79b826f31	[Dashboard] avoid showing disk info when it is unavailable (#24992 )	2022-05-24 17:13:47 -07:00
SangBin Cho	b9c30529d8	[Core/Observability 1/N] Add a "running" state to task status (#24651 ) This PR adds 2 more states into TaskStatus enum TaskStatus { // The task is scheduled properly and waiting for execution. // It includes time to deliver the task to the remote worker + queueing time // from the execution side. WAITING_FOR_EXECUTION = 5; // The task that is running. RUNNING = 6; }	2022-05-16 05:39:05 -07:00
Jiajun Yao	628f886af4	Don't show usage stats prompt in dashboard if prompt is disabled (#24700 )	2022-05-12 07:55:28 -07:00
Jiajun Yao	1daad65568	[Doc] Add doc for usage stats collection (#24522 )	2022-05-10 17:18:49 -07:00
Jiajun Yao	3fb63847e2	Show usage stats prompt (#23822 ) Show usage stats prompt when it's enabled. Current UX are: * The usage stats enabled or disabled message is shown every time in both terminal and dashboard. * If users don't explicitly enable or disable usage stats, the first time they start a ray cluster interactively, they will be asked to confirm and will enable if no user action within 10s. If it's non-interactive, collection is enabled by default without confirmation. * ray.init() doesn't collect usage stats * Usage stats can be disabled via three approaches: 1. RAY_USAGE_STATS_ENABLED env var, 2. ray xxx --disable-usage-stats, 3. ray disable-usage-stats	2022-04-25 16:01:24 -07:00
Amog Kamsetty	1d11963618	[Dashboard] Specify `@types/react` resolution (#23794 ) A new @types/react release has broken the dashboard build. Make sure to specify the older version under package resolutions.	2022-04-07 17:24:19 -07:00
mwtian	51feac9868	Clean up dev docs (#23407 )	2022-03-22 23:22:56 -07:00
Yi Cheng	7d2237bc9f	[dashboard] Remove unused fields in dashboard actor table for better memory footprint (#21919 )	2022-01-26 22:48:17 -08:00
Yao Yuan	422d20e945	[Dashboard] Fix NPE when there is no GPU on the node (#21650 ) There is an NPE bug that causes browser crash when no GPU on the node. We can add a condition to fix it.	2022-01-18 08:12:49 -08:00
Simon Mo	72ae22e82b	[CI] Fix frontend build issue (#20375 )	2021-11-15 10:12:43 -08:00
Avnish Narayan	026bf01071	[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535 ) * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-03 16:24:00 +01:00
Philipp Moritz	45f1ff0fa9	[Windows] Update react-scripts dependency for dashboard (#19489 )	2021-10-20 17:57:30 -07:00
Chu Xiangyang	505aa89d12	[Dashboard] Add start/end time for job (#18901 )	2021-09-28 20:57:13 -07:00
Chu Xiangyang	2220fe8a78	[Dashboard] Keep Job timestamp as millisecond (#18806 ) * [Dashboard] Keep Job timestamp as millisecond Current the `timestamp` is already millisecond, 13 digits long, so no need to * 1000 in the dashboard UI. * Fix format wih prettier * use Number to convert timestamp	2021-09-24 10:31:54 -07:00
Guyang Song	89ce8a3a02	support 'CustomFields' tooltip in dashboard (#18698 )	2021-09-17 17:48:32 +08:00
Dominic Ming	97f71e15d4	[Dashboard] new dashboard event page for API Server event module (#18330 )	2021-09-09 19:43:48 +08:00
Amog Kamsetty	39d60f62d2	[hotfix] fix material-ui version once more (#16901 )	2021-07-06 13:57:34 -07:00
Simon Mo	b11b35aa45	hotfix material-ui version again (#16897 )	2021-07-06 11:08:57 -07:00
Amog Kamsetty	d5ac5c45ea	[Dashboard] Pin `material-ui/lab` dependency (#16890 )	2021-07-06 10:49:10 -07:00
Simon Mo	677514b3ff	Revert "[Dashboard] Actor Table UI Optimize (#15802 )" (#15981 ) This reverts commit `43be599a9a`.	2021-05-21 10:56:15 -07:00
Dominic Ming	43be599a9a	[Dashboard] Actor Table UI Optimize (#15802 )	2021-05-21 09:23:32 -07:00
Ian Rodney	7b1c5dbe0a	[Hotfix][Lint] Pin other ESlint Deps (#15816 )	2021-05-14 09:18:43 -07:00
Ashwin Hegde	4d8ed6dd5c	#13890 [new-dashboard] add object store memory column (#15697 )	2021-05-11 15:36:16 -05:00
Ian Rodney	c50490ccef	[Lint] Pin Prettier to 2.3.0 (#15721 )	2021-05-10 11:46:29 -07:00
Ian Rodney	11b5c6c702	[HotFix][Lint] Fix Lint because of Prettier update (#15720 )	2021-05-10 09:51:41 -07:00
Dmitri Gekhtman	410f768046	[Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. (#14676 )	2021-04-05 17:16:20 -07:00
Eric Liang	9db000ff2c	Auto report object store memory usage; remove some deprecated code (#14260 )	2021-03-01 13:19:44 -08:00
niole	488f63efe3	[Dashboard] Make requests sent by the dashboard reverse proxy compatible (#14012 )	2021-02-24 18:31:59 -08:00
Kathryn Zhou	d6521be7ef	Export GPU metrics, CPU count, and additional Memory metrics to Prometheus (#14170 )	2021-02-22 10:04:18 -08:00
Kathryn Zhou	f6b5e838fe	Add disk and network metrics to Prometheus and fix dashboard (#14144 )	2021-02-17 10:27:14 -08:00
Dominic Ming	4b60c388ef	[Dashboard] fix new dashboard entrance and some table problem (#13790 )	2021-01-30 10:42:16 +08:00
Dominic Ming	752da83bb7	[Dashboard] Add the new dashboard code and prompt users to try it (#11667 )	2021-01-29 15:22:26 +08:00
Simon Mo	321bbe1ffb	[Dashboard] Fix GPU resource rendering issue (#13388 )	2021-01-14 12:23:21 -08:00
Max Fitton	25f7bdc0d8	[Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113 )	2020-12-30 14:19:56 -06:00
Sumanth Ratna	b7404e7955	[dashboard] Resolve npm vulnerabilities (#12620 ) * npm audit fix * npm dedupe	2020-12-08 10:26:49 -08:00
Max Fitton	34b9c7449b	[Dashboard] Fix object store memory display. (#12664 )	2020-12-07 21:40:49 -08:00
Max Fitton	a5c846c83b	[Dashboard][Bugfix] Filter dead nodes from Machine View (fixes duplicate node issue) (#12579 )	2020-12-02 14:08:14 -08:00
Max Fitton	2708b3abbc	[Dashboard][Bug] Fix duplicate node total rows in dashboard (#12410 ) * Fix duplicate node total rows in dashboard by changing the react key of the NodeTotalRow component from the node IP to the node ID (node IP can be duplicated in the case of docker). * simplify a piece of test code and fix a flaky time out * lint	2020-11-30 18:43:09 -08:00

1 2

64 commits