hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
shrekris-anyscale	b51d0aa8b1	[serve] Introduce `context.py` and `client.py` (#24067 ) Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files: * `context.py` * Intended to expose process-wide state to internal Serve code as well as `api.py` * Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables * `client.py` * Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`	2022-04-21 18:35:09 -05:00
jon-chuang	ddcc252b51	[Core] Ray logs API (1/n) (#23435 ) Expose HTTP endpoint to retrieve logs from ray cluster	2022-04-20 23:11:02 -07:00
Chu Xiangyang	6f74040b15	[Job] Fix typo in job sdk docstring (#23940 )	2022-04-20 12:30:32 -05:00
SangBin Cho	082baa2342	[Test] Fix test_log (#24004 ) The test verifies the first line 43~51 bytes are "dashboard" But due to recent code addition to head.py, the line where logs are written became 2 digits -> 3 digits Previously, 2022-04-18 23:23:56,946 INFO head.py:[less than 100] -- Dashboard head grpc address: 127.0.0.1:57208 Now 2022-04-18 23:23:56,946 INFO head.py:101 -- Dashboard head grpc address: 127.0.0.1:57208 So we should increase the bytes range.	2022-04-19 04:59:30 -07:00
SangBin Cho	1c3329fa38	Revert "Revert "[State Observability] Basic functionality for central… (#23933 ) …ized data (#23744)" (#23918)" This reverts commit `fb14e82`.	2022-04-18 21:15:43 -07:00
shrekris-anyscale	6151b75d9d	[serve] Move schema helpers out of `api.py` (#23934 )	2022-04-18 12:25:21 -05:00
mwtian	d5d2ef4249	[Core] Add a utility to check GCS / Ray cluster health (#23382 ) * Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well. * Improve the error message when the address provided to Ray does not contain port.	2022-04-18 09:58:45 -07:00
Amog Kamsetty	fb14e82242	Revert "[State Observability] Basic functionality for centralized data (#23744 )" (#23918 ) This reverts commit `51a4a1a802`. breaking tune multinode tests and kuberay:test_autoscaling_e2e	2022-04-14 14:28:42 -07:00
Tomasz Wrona	46e0162441	GcsPublisher is being constructed with unsupported position argument To avoid this error: (raylet) Traceback (most recent call last): (raylet) File "/home/iamhatesz/.pyenv/versions/alan-brain-py3.9/lib/python3.9/site-packages/ray/dashboard/agent.py", line 407, in <module> (raylet) gcs_publisher = GcsPublisher(args.gcs_address) (raylet) TypeError: __init__() takes 1 positional argument but 2 were given	2022-04-14 10:47:57 -07:00
SangBin Cho	51a4a1a802	[State Observability] Basic functionality for centralized data (#23744 ) Support listing actor/pg/job/node/workers Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2 Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.	2022-04-14 07:33:18 -07:00
Tao Wang	6aefe9b36e	[Core]Save task spec in separate table (#22650 ) This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs. As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.	2022-04-12 12:24:26 -07:00
Amog Kamsetty	1d11963618	[Dashboard] Specify `@types/react` resolution (#23794 ) A new @types/react release has broken the dashboard build. Make sure to specify the older version under package resolutions.	2022-04-07 17:24:19 -07:00
shrekris-anyscale	a6bcb6cd1e	[serve] Create `application.py` (#23759 ) The `Application` class is stored in `api.py`. The object is relatively standalone and is used as a dependency in other classes, so this change moves `Application` (and `ImmutableDeploymentDict`) to a new file, `application.py`.	2022-04-07 10:34:24 -05:00
Stephanie Wang	b43426bc33	[core] Add metrics for disk and network I/O (#23546 ) Adds some metrics useful for object-intensive workloads: Per raylet/object manager: Add num bytes pending restore to spill manager Add num requests cumulative to PullManager Num bytes pushed/pulled from other nodes cumulative Histogram for request latencies in PullManager: total life time of request, from start to cancel request satisfaction time, from start to object local pull time, from object activation to object local Per-node disk read/write speed, IOPS	2022-04-01 11:15:34 -07:00
mwtian	1a4c3c07f7	[Dashboard] fix iterating over GPU processes (#23562 ) Current logic looks broken, as reported in #22954 (comment) I fixed the logic as best as I can, and tested it on Anyscale platform with GPU. No process info was reported from gpustat. But the logic works under this case.	2022-03-31 17:16:53 -07:00
Matti Picus	77c4c1e48e	WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449 )	2022-03-29 00:56:42 -07:00
mwtian	c2404cce62	avoid adding gpu utilization when unavailable (#23468 ) From #22954, GPU utilization can be unavailable for consumer hardware. So dashboard should not assume the value cannot be None. There might be a better way to represent "not reported". But currently utilizations are summed up which makes using non-zero to represent "not reported" hard to do.	2022-03-24 22:48:17 -07:00
Max Pumperla	60054995e6	[docs] fix doctests and activate CI (#23418 )	2022-03-24 17:04:02 -07:00
mwtian	51feac9868	Clean up dev docs (#23407 )	2022-03-22 23:22:56 -07:00
shrekris-anyscale	b00977b1b1	[serve] Remove dashboard's dependency on Serve (#23389 )	2022-03-21 22:14:41 -07:00
Guyang Song	69af9764b2	[runtime env] URI reference refactor (#22828 ) - Move the URI reference logic from raylet to agent. - Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible` - More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528 Future works - We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this. - Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count. - Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?	2022-03-21 11:21:15 -05:00
shrekris-anyscale	75b7465ba4	[serve] Reject Ray client addresses when submitting via Dashboard (#23339 ) Some commands in the Serve CLI use Ray client and some commands ping the Ray dashboard; however, all commands read `RAY_ADDRESS` to get the address. This change raises a nice exception if the user accidentally passes a Ray client address as the Ray Dashboard address.	2022-03-21 11:17:51 -05:00
shrekris-anyscale	c668039020	[serve] Restore "Get new handle to controller if killed" (#23283 ) (#23338 ) #23336 reverted #23283. #23283 did pass CI before merging. However, when it merged, it began to fail because it used commands that were outdated on the Master branch in `test_cli.py` (specifically `serve info` instead of `serve config`). This change restores #23283 and updates its tests commands.	2022-03-18 18:40:08 -05:00
shrekris-anyscale	87e77bebb4	Revert "[serve] Get new handle to controller if killed (#23283 )" (#23336 ) This reverts commit `9f6d96a2fd`.	2022-03-18 13:47:57 -05:00
Jialing He	4a83bc3dc2	[runtime env] Support set timeout for runtime env setup (#23082 ) Interface example: ```python @ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10)) def f(): pass @ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}}) def f(): pass ``` Support set timeout second for timeout of runtime environment creation. Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-03-18 12:52:59 -05:00
Archit Kulkarni	76bb5396c7	[Doc] [jobs] Add links to Job Submission and improve doc (#23209 ) - Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested. - Adds docstrings for the Jobs SDK, which automatically show up in the API reference - Improve the Job Submission main page - Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-18 12:52:13 -05:00
shrekris-anyscale	9f6d96a2fd	[serve] Get new handle to controller if killed (#23283 ) `serve shutdown` is not idempotent with the new Serve CLI. When serve shuts down, it kills the controller. The REST API does not refresh its cached controller handle, so it attempts to make requests to a dead actor, which fail. This change updates the REST API and `serve.start()` to refresh the controller handle if the controller has been killed.	2022-03-18 11:47:18 -05:00
shrekris-anyscale	aaf47b2493	[serve] Implement `serve.run()` and `Application` (#23157 ) These changes expose `Application` as a public API. They also introduce a new public method, `serve.run()`, which allows users to deploy their `Applications` or `DeploymentNodes`. Additionally, the Serve CLI's `run` command and Serve's REST API are updated to use `Applications` and `serve.run()`. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-18 11:12:09 -05:00
Jiajun Yao	62a5404369	Collect more usage stats data (#23167 )	2022-03-17 19:33:27 -07:00
Archit Kulkarni	77090144a2	[jobs] Add `entrypoint` field to JobInfo (#23253 )	2022-03-16 22:02:22 -05:00
mwtian	391901f86b	[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233 ) Continue to clean up Redis and other related Redis references, for - gcs_utils.py - log_monitor.py - `publish_error_to_driver()`	2022-03-16 19:34:57 -07:00
SangBin Cho	b350fe9ee8	[Nightly test] Fix additional k8s issues + add new tests (#23231 ) Fix bug from the previous fixes. Add more tests Stop using m5.xlarge (not supported now) There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.	2022-03-16 16:37:29 -07:00
Archit Kulkarni	8707eb6288	[runtime env] Support `.whl` files in `py_modules` (#22368 ) The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster. One gap in this is if the local Python module is in the form of a wheel (`.whl` file.) This PR adds the missing support for uploading and installing the `.whl` file.	2022-03-16 16:37:10 -05:00
shrekris-anyscale	34ebb3409e	[serve] Make Dashboard start Serve in the "serve" namespace (#23198 ) The Ray Dashboard starts Serve in the `"_ray_internal_dashboard"` namespace. However, Serve by default starts in the `"serve"` namespace. This causes surprising behavior when working with the Serve CLI and REST API. This change make the Ray Dashboard start Serve in the `"serve"` namespace, allowing the REST API to work intuitively with the Python API.	2022-03-16 12:03:44 -05:00
mwtian	72ef9f91aa	[Remove Redis Pubsub 1/n] Remove `enable_gcs_pubsub()` (#23189 ) GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.	2022-03-15 23:56:15 -07:00
Tomas Babej	7a1d10a3d0	[Job submission] Set headers when establishing websocket (#23111 )	2022-03-15 16:20:44 -05:00
Guyang Song	f65971756d	[dashboard agent] Catch agent port conflict (#23024 )	2022-03-15 16:09:15 +08:00
Archit Kulkarni	e8496374e2	[Jobs] Test job submit with no specified ray address (#23119 )	2022-03-14 13:44:06 -05:00
Jialing He	39a6c054d3	[runtime env][feature] introduce pip_check_enable and pip_version (#22826 )	2022-03-14 23:41:19 +08:00
Yi Cheng	4f86b5b523	[gcs] Remove `use_gcs_for_bootstrap` in core (python) and autoscaler (#23050 ) This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.	2022-03-11 14:36:16 -08:00
Archit Kulkarni	52a722ffe7	[jobs] Make local pip/conda requirements files work with jobs (#22849 )	2022-03-10 15:15:16 -06:00
Yi Cheng	bb5fa6b851	Remove redis in setup.py (#22979 )	2022-03-10 11:05:03 -08:00
Archit Kulkarni	c78bd809ce	[job submission] Support local py_modules in jobs (#22843 )	2022-03-10 11:42:25 -06:00
shrekris-anyscale	1100c98222	[serve] Implement Serve Application object (#22917 ) The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.	2022-03-10 10:28:29 -06:00
shrekris-anyscale	bc82e2d5c4	[serve] Restore "[serve] Support working_dir in serve run (#22760 )" (#22971 )	2022-03-09 21:31:23 -08:00
Kai Fricke	15601ed79b	Revert "[serve] Support `working_dir` in `serve run` (#22760 )" (#22956 ) This reverts commit `ab2741d64b`. The PR breaks ray job submission for anyscale:// URLs	2022-03-09 17:04:46 +00:00
shrekris-anyscale	ab2741d64b	[serve] Support `working_dir` in `serve run` (#22760 ) #22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.	2022-03-08 13:18:41 -06:00
shrekris-anyscale	521298e093	[serve] Make route prefix the deployment name by default (#22840 ) The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted. This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access. This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.	2022-03-06 20:03:31 -06:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00

1 2 3 4 5 ...

459 commits