hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
SangBin Cho	b350fe9ee8	[Nightly test] Fix additional k8s issues + add new tests (#23231 ) Fix bug from the previous fixes. Add more tests Stop using m5.xlarge (not supported now) There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.	2022-03-16 16:37:29 -07:00
Archit Kulkarni	8707eb6288	[runtime env] Support `.whl` files in `py_modules` (#22368 ) The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster. One gap in this is if the local Python module is in the form of a wheel (`.whl` file.) This PR adds the missing support for uploading and installing the `.whl` file.	2022-03-16 16:37:10 -05:00
shrekris-anyscale	34ebb3409e	[serve] Make Dashboard start Serve in the "serve" namespace (#23198 ) The Ray Dashboard starts Serve in the `"_ray_internal_dashboard"` namespace. However, Serve by default starts in the `"serve"` namespace. This causes surprising behavior when working with the Serve CLI and REST API. This change make the Ray Dashboard start Serve in the `"serve"` namespace, allowing the REST API to work intuitively with the Python API.	2022-03-16 12:03:44 -05:00
mwtian	72ef9f91aa	[Remove Redis Pubsub 1/n] Remove `enable_gcs_pubsub()` (#23189 ) GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.	2022-03-15 23:56:15 -07:00
Tomas Babej	7a1d10a3d0	[Job submission] Set headers when establishing websocket (#23111 )	2022-03-15 16:20:44 -05:00
Guyang Song	f65971756d	[dashboard agent] Catch agent port conflict (#23024 )	2022-03-15 16:09:15 +08:00
Archit Kulkarni	e8496374e2	[Jobs] Test job submit with no specified ray address (#23119 )	2022-03-14 13:44:06 -05:00
Jialing He	39a6c054d3	[runtime env][feature] introduce pip_check_enable and pip_version (#22826 )	2022-03-14 23:41:19 +08:00
Yi Cheng	4f86b5b523	[gcs] Remove `use_gcs_for_bootstrap` in core (python) and autoscaler (#23050 ) This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.	2022-03-11 14:36:16 -08:00
Archit Kulkarni	52a722ffe7	[jobs] Make local pip/conda requirements files work with jobs (#22849 )	2022-03-10 15:15:16 -06:00
Yi Cheng	bb5fa6b851	Remove redis in setup.py (#22979 )	2022-03-10 11:05:03 -08:00
Archit Kulkarni	c78bd809ce	[job submission] Support local py_modules in jobs (#22843 )	2022-03-10 11:42:25 -06:00
shrekris-anyscale	1100c98222	[serve] Implement Serve Application object (#22917 ) The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.	2022-03-10 10:28:29 -06:00
shrekris-anyscale	bc82e2d5c4	[serve] Restore "[serve] Support working_dir in serve run (#22760 )" (#22971 )	2022-03-09 21:31:23 -08:00
Kai Fricke	15601ed79b	Revert "[serve] Support `working_dir` in `serve run` (#22760 )" (#22956 ) This reverts commit `ab2741d64b`. The PR breaks ray job submission for anyscale:// URLs	2022-03-09 17:04:46 +00:00
shrekris-anyscale	ab2741d64b	[serve] Support `working_dir` in `serve run` (#22760 ) #22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.	2022-03-08 13:18:41 -06:00
shrekris-anyscale	521298e093	[serve] Make route prefix the deployment name by default (#22840 ) The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted. This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access. This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.	2022-03-06 20:03:31 -06:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Dmitri Gekhtman	4acbf36453	[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688 ) Closes #21353 and fixes an issue that causes dashboard to read K8s CPU requests rather than resources when determining CPUs available.	2022-03-01 17:15:59 -08:00
Edward Oakes	2a09561edf	[serve] Enable REST API tests with main clause (#22706 ) Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>	2022-03-01 11:21:22 -06:00
shrekris-anyscale	49ee443231	[serve] Add Serve CLI commands for REST API (#22648 )	2022-02-28 20:45:46 -06:00
Archit Kulkarni	85657b1377	[Doc] [Jobs] add CLI and SDK reference to docs (#22680 )	2022-02-28 17:57:46 -06:00
Jialing He	aa1885ae2a	[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588 ) I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads. [The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.	2022-02-28 17:33:13 +08:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
shrekris-anyscale	8548affdc2	Increase `test_failed_job_status` timeout in `test_job_submission` (#22643 ) `test_job_submission` has become [flakey](https://flakey-tests.ray.io/) due to timeout. This change increases the timeout in `test_failed_job_status` from 10 to 25 seconds.	2022-02-25 10:08:55 -08:00
shrekris-anyscale	e85540a1a2	[serve] Expose deployment statuses in REST API (#22611 )	2022-02-25 08:41:07 -06:00
shrekris-anyscale	a9ede4e499	[serve] Add REST API (#22578 ) This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.	2022-02-24 10:00:26 -06:00
Stephanie Wang	abf2a70a29	[core] Add task and object reconstruction status to ray memory (#22317 ) Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be: ``` // The task is waiting for its dependencies to be created. WAITING_FOR_DEPENDENCIES = 1; // All dependencies have been created and the task is scheduled to execute. SCHEDULED = 2; // The task finished successfully. FINISHED = 3; ``` In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output: ``` IP Address \| PID \| Type \| Call Site \| Status \| Size \| Reference Type \| Object Ref 192.168.4.22 \| 279475 \| Driver \| (task call) ... \| Attempt #2: FINISHED \| 10000254.0 B \| LOCAL_REFERENCE \| c2668a65bda616c1ffffffffffffffffffffffff0100000001000000 ```	2022-02-22 21:26:21 -08:00
shrekris-anyscale	40fa56f40c	[serve] Add JSON schemas for REST API (#22547 )	2022-02-22 21:36:42 -06:00
SangBin Cho	36a31cb6fd	[Usage Stats] Implement usage stats report "Turned off by default". (#22249 ) This is the second PR to implement usage stats on Ray. Please refer to the file usage_lib.py for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. This adds a dashboard module to enable usage stats. Usage stats report is turned off by default after this PR. We can control the report (enablement, report period, and URL. Note that URL is strictly for testing) using the env variable. ## NOTE This requires us to add `requests` to the default library. `requests` must be okay to be included because 1. it is extremely lightweight. It is implemented only with built-in libs. 2. It is really stable. The project basically claims they are "deprecated", meaning no new features will be added there. cc @edoakes @richardliaw for the approval For the HTTP request, I was alternatively considered httpx, but it was not as lightweight as `requests`. So I decided to implement async requests using the thread pool.	2022-02-22 15:32:02 -08:00
Edward Oakes	58e5f0140d	[jobs] Rename JobData -> JobInfo (#22499 ) `JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.	2022-02-22 16:18:16 -06:00
Guyang Song	902243fb03	[runtime env] support raylet sharing fate with agent (#22382 ) - Remove the agent restart feature. - Raylet shares fate with agent to make the failover logic easier. Refer to issue https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528	2022-02-21 18:16:21 +08:00
Guyang Song	57a94aae12	[runtime env][bugfix] Fix runtime env retry (#22495 ) - Bug: `error_message` is not cleared when the retry succeeds. This bug lead to runtime env creation failing. - Add test case for this.	2022-02-18 17:09:06 -08:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	63a5eb492d	Revert "[serve] Add basic REST API to dashboard (#22257 )" (#22414 ) This reverts commit `f37f35c5da`.	2022-02-15 21:47:50 -06:00
Edward Oakes	f37f35c5da	[serve] Add basic REST API to dashboard (#22257 )	2022-02-15 15:36:58 -06:00
Jialing He	192f9de421	[runtime env] Introduce async Manager.create (#22311 )	2022-02-14 16:26:47 -06:00
Liu Bao	824453dd17	[runtime env] Create virtualenv for pip runtime env. (#21801 )	2022-02-10 12:25:18 -06:00
Edward Oakes	5df2a0a6c6	[jobs] Add test condition that job runs w/o CPUs available on head node (#22260 )	2022-02-10 10:23:02 -06:00
Archit Kulkarni	50e2bef9d0	[Jobs] Hide `dashboard` from Job Submission import path (#22223 ) For public SDK APIs, change the import path from ```python from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo from ray.dashboard.modules.job.sdk import JobSubmissionClient ``` to ```python from ray.job_submission import JobStatus, JobSubmissionClient ``` `JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.	2022-02-09 13:55:32 -06:00
SangBin Cho	20ab9188c6	[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170 ) This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files. The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. After this PR, we will add code to enable usage report "off by default".	2022-02-08 22:12:36 -08:00
Nikita Vemuri	d19aaf0fd3	[jobs] Add unit test for `parse_cluster_info` (#22205 ) Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.	2022-02-08 11:22:28 -06:00
Edward Oakes	8806b2d5c4	[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180 )	2022-02-07 15:25:25 -06:00
Jiao	a692e7d05e	[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938 ) As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address. Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2022-02-04 17:51:43 -06:00
SangBin Cho	ea4079465d	[Runtime Env] Support runtime env error message for actors (#22109 )	2022-02-04 15:32:02 -06:00
Nikita Vemuri	d9dc388082	[jobs] Support ray client format of connection string address for external module (#22116 ) Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.	2022-02-04 13:35:10 -06:00
SangBin Cho	d7fc7d2e9d	[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032 ) This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers. The first PR only handles tasks exception. TODO - [x] Task (this PR) - [ ] Actor - [ ] Turn of runtime env logs & improve error msgs	2022-02-03 16:47:04 -08:00
Archit Kulkarni	78f882dbbc	[runtime env] Local uri caching for working_dir, py_modules and conda (#20273 ) Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them. In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache. When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs. Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit - Adds unit tests for caching and integration tests for working_dir caching	2022-02-02 14:53:03 -06:00
Edward Oakes	e85bbfb338	[jobs] Enable default port in `http://` addresses (#22014 ) Closes https://github.com/ray-project/ray/issues/22012	2022-02-02 14:34:34 -06:00

1 2 3 4 5 ...

378 commits