hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
xiaofeng	9f8b596aaa	[serve][xlang]Support deploying Python deployment from Java. (#26877 ) In the previously merged pr(https://github.com/ray-project/ray/pull/22726/commits), java serve's support for python deployment was not implemented. This PR is used to implement this feature. Co-authored-by: nanqi.yxf <nanqi.yxf@antgroup.com>	2022-08-06 14:35:49 +08:00
Alex Wu	50e278f58b	[scheduler][autoscaler] Report placement resources for actor creation tasks (#26813 ) This change makes us report placement resources for actor creation tasks. Essentially, the resource model here is that a placement resource/actor creation task is a task that runs very quickly. Closes #26806 Co-authored-by: Alex <alex@anyscale.com>	2022-08-05 22:02:44 -07:00
Clark Zinzow	f017fcd826	[AIR - Datasets] Fix column assignment in Concatenator for Pandas 1.2. (#27531 ) Heterogeneous tensor column assignment with a list-of-tensors fails for Pandas 1.2, but succeeds with a manually constructed Pandas Series.	2022-08-05 19:44:12 -07:00
Clark Zinzow	dcc1da4ce3	[Core] Fix reference counting bug on objects borrowed for a cancelled actor creation. (#27298 ) This PR fixes a reference counting bug for borrowed objects sent to an actor creation task that is then cancelled. Before this PR, when actor creation is cancelled before the creation task has been scheduled, the GCS-based actor manager would would destroy the actor without replying to the task submission RPC from the actor creating worker, resulting in the reference counts on that worker to never get cleaned up. This caused us to leak borrowed objects when such cancellation-before-scheduling happened for actors. This PR fixes this by ensuring that the task submission RPC receives a reply indicating that the actor creation task has been cancelled, at which point the submitting worker will run through the same reference counting cleanup as is done for normal task cancellation.	2022-08-05 19:42:34 -07:00
Alan Guo	326b5bd1ac	Convert job_manager to be async (#27123 ) Updates jobs api Updates snapshot api Updates state api Increases jobs api version to 2 Signed-off-by: Alan Guo aguo@anyscale.com Why are these changes needed? follow-up for #25902 (comment)	2022-08-05 19:33:49 -07:00
Nikita Vemuri	a82af8602c	[core] Support external ray dashboard URL (#27396 ) Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com Why are these changes needed? Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.	2022-08-05 19:33:10 -07:00
Alex Wu	a6b9019d38	[log_monitor] Seek when reopening a file due to inode change (#27508 ) When reopening a file due to an inode change, we weren't seeking back to the right location. Now we are (with a unit test). Closes (but not really until it's cherry-picked) #27507 Co-authored-by: Alex <alex@anyscale.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-08-05 18:27:43 -07:00
Clark Zinzow	293452dcba	[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449 ) This reverts commit `cf7305a`, and unreverts #25896. This was reverted due to a failing Windows test: #26287 We can merge once the failing Windows test (and all other relevant tests) pass.	2022-08-05 16:07:13 -07:00
Simon Mo	f6d19ac7c0	[Serve] Gate the deprecation warnings behind envvar (#27479 )	2022-08-05 13:38:44 -07:00
Clark Zinzow	313d553cfc	[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` (#27343 ) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.	2022-08-05 13:35:40 -07:00
Clark Zinzow	bfc38de009	[Datasets] [Docs] Improve `.limit()` and `.take()` docstrings (#27367 ) Improve docstrings for .limit() and .take(), making the distinction more clear. Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-08-05 12:17:24 -07:00
Richard Liaw	4629a3a649	[air/docs] Update Trainer documentation (#27481 ) Co-authored-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-08-05 11:21:19 -07:00
zcin	22db41c21a	[Serve][doc] Modify and Combine Tensorflow, Pytorch, Sklearn Tutorials (#26817 )	2022-08-05 11:55:31 -05:00
zcin	8a9d994dd0	[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403 ) Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.	2022-08-05 11:31:00 -05:00
zcin	b5927caaae	[serve] Update version if import_path or runtime_env in config is changed (#27498 ) Previous PR that adds in lightweight config updates: https://github.com/ray-project/ray/pull/27000. It only tracks the config options for `deployments` (bumps version if certain deployment options are changed, but otherwise keeps versions the same). However we should bump the versions of all deployments if `import_path` or `runtime_env` is changed.	2022-08-05 11:30:22 -05:00
Jialing He	ccf411604e	Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308 )	2022-08-05 16:32:48 +08:00
Archit Kulkarni	1714d0266b	[Doc] [Serve] Refresh code for "monitoring" for 2.0 (#27400 )	2022-08-04 20:10:12 -07:00
Dmitri Gekhtman	b1d838446c	[autoscaler] Fix Prometheus metric autoscaler hang bug (#27532 ) Failed node launch can lead to an extra unexpected error in the node launcher due to the definition of a mock prometheus metric method. This failure leads to a permanently hanging autoscaler with "launching nodes" never cleared out and the autoscaler unable to proceed to launch nodes. This PR fixes the method signature leading to the unexpected failure.	2022-08-04 19:48:31 -07:00
Alex Wu	eb9c5d8fa7	[autoscaler][aws] Bump max keys per account (#27506 ) Signed-off-by: Alex Wu <alex@anyscale.io> This is a minor QoL improvement to bump the hardcoded limit for number of aws keys per account. The limit is arbitrary and has been bumped before. AFAICT the fundamental aws limit is a 5000 key per region limit which we are not close to.	2022-08-04 15:12:55 -07:00
Richard Liaw	b2cd34cc5c	[air] Remove checkpoint user guide and update key concepts and docstring (#27455 )	2022-08-04 08:55:26 -07:00
xwjiang2010	8d5c07b781	[air/train/docs] Add trainer user guide and update trainer docs (#27389 ) This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections. The main file to review is here: doc/source/ray-air/trainer.rst. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-08-04 13:59:50 +01:00
Kai Fricke	b6765bb4f3	[air/tune/train] Update/fix API annotations (#27428 ) This bumps annotations to beta or demotes to DeveloperAPI Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-04 09:05:04 +01:00
Ricky Xu	8498a56fe2	[Core][fix] Increasing timeout on non-windows for test_metrics (#27379 ) The test was timing out. A normal pass was ~17secs.	2022-08-03 15:22:00 -07:00
Alan Guo	2cf9ecf48e	Make it so pydantic is required before we launch dashboard api server (#27345 ) * Make it so pydantic is required before we launch dashboard api server Signed-off-by: Alan Guo <aguo@anyscale.com>	2022-08-03 14:24:51 -07:00
Balaji Veeramani	fd381927c1	[AIR] Add optional `mode` parameter and make `size` parameter optional (#27295 ) 1. If a user reads a folder with grayscale and color images, ImageFolderDatasource errors. 2. There's no way to retain image shapes. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-08-03 13:20:46 -07:00
zcin	286343601a	[Serve] Enable lightweight config update (#27000 )	2022-08-03 11:49:41 -05:00
xwjiang2010	ff2b728e9a	[air] add tuner user guide (#26837 ) Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-03 09:43:42 -07:00
Kai Fricke	20119c7022	[tune] Fix test_actor_reuse.py::ActorReuseMultiTest test (#27427 ) Increase time to allow for scheduling latency Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-03 13:54:11 +01:00
Kai Fricke	46ed3557ba	[tune] Fix test_resource_exhausted_info test (#27426 ) #27213 broke this test Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-03 13:53:46 +01:00
Simon Mo	4e07019b88	[Serve] Fix Graph Repeated Invocation (#27417 )	2022-08-03 01:40:19 -07:00
shrekris-anyscale	adc7c4dc87	[Serve] Make `serve.run()` and `deployment.bind()` beta APIs (#27401 )	2022-08-02 23:11:23 -07:00
Simon Mo	6084eb6a9f	Revert "Revert "[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685 )" (#27283 )" (#27348 )	2022-08-02 20:04:03 -07:00
Richard Liaw	6dc3dbdd37	[air] Update to beta (#27393 ) Update API references to beta. Needed as we are going to beta in 2.0. I left out RL/Scikit-Learn/HuggingFace.	2022-08-02 17:10:41 -07:00
Eric Liang	91a03026ef	[air] Fix BatchPredictor.predict_pipelined not working with GPU stage (#27232 )	2022-08-02 15:36:40 -07:00
Clark Zinzow	291a294208	[AIR - Serve] [Hotfix] Check for tensor extension via dtype rather than a NumPy conversion (#26891 ) Converting a Pandas DataFrame column to an ndarray (e.g. via df[col].values) can often result in a full copy of the column in order to construct the ndarray due to Pandas' 2D block management. This PR ports tensor extension type checking to checking the dtype, which is always an O(1) check. Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-08-02 14:52:46 -07:00
Ricky Xu	122eda2757	[Core] Move test_state_api test back to large test groups (#27377 ) Why are these changes needed? python/tests/test_state_api.py runs for 5min in normal run	2022-08-02 14:21:34 -07:00
Simon Mo	a9d94f740c	[Serve] Remove the warning for async handles in 2.0 (#27346 ) Signed-off-by: simon-mo <simon.mo@hey.com>	2022-08-02 15:07:41 -05:00
Jiajun Yao	cd2e590567	Support placement_group=None in PlacementGroupSchedulingStrategy (#27370 ) We decided to allow escaping the parent pg via `PlacementGroupSchedulingStrategy(placement_group=None)` instead of using "DEFAULT". Our doc is updated with that but in the code it's still not allowed.	2022-08-02 12:49:41 -07:00
Eric Liang	a1cb735035	Raise the (runtime_env max size) gRPC max message size to 500MiB Signed-off-by: Eric Liang <ekhliang@gmail.com>	2022-08-02 12:41:34 -07:00
Ricky Xu	82a24f9319	[Doc][Core][State Observability] Adding Python SDK doc and docstring (#26997 ) 1. Add doc for python SDK and docstrings on public SDK 2. Rename list -> ray_list and get -> ray_get for better naming 3. Fix some typos 4. Auto translate address to api server url. Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2022-08-02 11:24:59 -05:00
Kai Fricke	149c031c4b	[tune/release] Do not use spot instances in k8s tests (#27250 ) Spot instances are not being booted up, so let's go without them. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-02 11:30:41 +01:00
Yi Cheng	a9697722cf	[workflow] Change `step` to `task` in workflow. (#27330 ) * change step to task Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> * fix comments Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> * fix comments Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> * fix comments Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>	2022-08-01 22:27:41 -07:00
Yi Cheng	00d22b6c7c	[core] Fix the test_failure_3.py in win (#27332 ) Win tests were broken because when the child is killed, the parent is also killed. Change the signal sent and make it work.	2022-08-01 18:55:07 -07:00
shrekris-anyscale	324d8e4bca	[Serve] Serialize `user_config` with JSON instead of Pickle (#26235 )	2022-08-01 17:53:43 -07:00
Eric Liang	f7ae8923f6	[docs] Reorganize the tensor data support docs; general editing (#26952 ) Why are these changes needed? Editing pass over the tensor support docs for clarity: Make heavy use of tabbed guides to condense the content Rewrite examples to be more organized around creating vs reading tensors Use doc_code for testing	2022-08-01 17:31:41 -07:00
shrekris-anyscale	cc84953da3	[Serve] [Docs] Update "Getting Started" documentation (#26745 )	2022-08-01 16:31:48 -07:00
jonathan-conder-sm	1d5fef2004	Fix dashboard with prometheus-client 0.14 (#23766 ) Why are these changes needed? The dashboard wasn't working (blank screen). See the linked issue for details. The cause is this exception in /tmp/ray/session_latest/logs/dashboard_agent.log: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module> loop.run_until_complete(agent.run()) File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run modules = self._load_modules() File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules c = cls(self) File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__ self._metrics_agent = MetricsAgent( File "/usr/local/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__ prometheus_exporter.new_stats_exporter( File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter exporter = PrometheusStatsExporter( File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__ self.serve_http() File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http start_http_server( File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server TmpServer.address_family, addr = _get_best_family(addr, port) File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family infos = socket.getaddrinfo(address, port) File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known There was a recent change in prometheus-client which passes the address given to start_http_server to socket.getaddrinfo. This prevents passing in an empty string, but we can get the same effect by passing None. Related issue number Closes #23765	2022-08-01 10:25:38 -07:00
Sihan Wang	410fe1b5ec	[Serve] Support Multiple DAG Entrypoints in DAGDriver (#26573 )	2022-08-01 09:16:36 -07:00
matthewdeng	83fb2fb21d	[tune] pin pymoo (#27311 ) Signed-off-by: Matthew Deng <matt@anyscale.com>	2022-07-31 01:06:27 -07:00
SangBin Cho	ec69fec1e0	Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302 )" (#27242 ) This reverts commit `14dee5f6a3`.	2022-07-30 00:08:23 -07:00

1 2 3 4 5 ...

7565 commits