In the previously merged pr(https://github.com/ray-project/ray/pull/22726/commits), java serve's support for python deployment was not implemented. This PR is used to implement this feature.
Co-authored-by: nanqi.yxf <nanqi.yxf@antgroup.com>
This change makes us report placement resources for actor creation tasks. Essentially, the resource model here is that a placement resource/actor creation task is a task that runs very quickly.
Closes#26806
Co-authored-by: Alex <alex@anyscale.com>
This PR fixes a reference counting bug for borrowed objects sent to an actor creation task that is then cancelled.
Before this PR, when actor creation is cancelled before the creation task has been scheduled, the GCS-based actor manager would would destroy the actor without replying to the task submission RPC from the actor creating worker, resulting in the reference counts on that worker to never get cleaned up. This caused us to leak borrowed objects when such cancellation-before-scheduling happened for actors.
This PR fixes this by ensuring that the task submission RPC receives a reply indicating that the actor creation task has been cancelled, at which point the submitting worker will run through the same reference counting cleanup as is done for normal task cancellation.
Updates jobs api
Updates snapshot api
Updates state api
Increases jobs api version to 2
Signed-off-by: Alan Guo aguo@anyscale.com
Why are these changes needed?
follow-up for #25902 (comment)
Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com
Why are these changes needed?
Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.
When reopening a file due to an inode change, we weren't seeking back to the right location. Now we are (with a unit test).
Closes (but not really until it's cherry-picked) #27507
Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This reverts commit cf7305a, and unreverts #25896.
This was reverted due to a failing Windows test: #26287
We can merge once the failing Windows test (and all other relevant tests) pass.
Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data.
This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.
Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.
Previous PR that adds in lightweight config updates: https://github.com/ray-project/ray/pull/27000. It only tracks the config options for `deployments` (bumps version if certain deployment options are changed, but otherwise keeps versions the same). However we should bump the versions of all deployments if `import_path` or `runtime_env` is changed.
Failed node launch can lead to an extra unexpected error in the node launcher due to the definition of a mock prometheus metric method.
This failure leads to a permanently hanging autoscaler with "launching nodes" never cleared out and the autoscaler unable to proceed to launch nodes.
This PR fixes the method signature leading to the unexpected failure.
Signed-off-by: Alex Wu <alex@anyscale.io>
This is a minor QoL improvement to bump the hardcoded limit for number of aws keys per account. The limit is arbitrary and has been bumped before. AFAICT the fundamental aws limit is a 5000 key per region limit which we are not close to.
This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections.
The main file to review is here: doc/source/ray-air/trainer.rst.
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
1. If a user reads a folder with grayscale and color images, ImageFolderDatasource errors.
2. There's no way to retain image shapes.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Converting a Pandas DataFrame column to an ndarray (e.g. via df[col].values) can often result in a full copy of the column in order to construct the ndarray due to Pandas' 2D block management. This PR ports tensor extension type checking to checking the dtype, which is always an O(1) check.
Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>
We decided to allow escaping the parent pg via `PlacementGroupSchedulingStrategy(placement_group=None)` instead of using "DEFAULT". Our doc is updated with that but in the code it's still not allowed.
1. Add doc for python SDK and docstrings on public SDK
2. Rename list -> ray_list and get -> ray_get for better naming
3. Fix some typos
4. Auto translate address to api server url.
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Why are these changes needed?
Editing pass over the tensor support docs for clarity:
Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
Why are these changes needed?
The dashboard wasn't working (blank screen). See the linked issue for details. The cause is this exception in /tmp/ray/session_latest/logs/dashboard_agent.log:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/usr/local/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
There was a recent change in prometheus-client which passes the address given to start_http_server to socket.getaddrinfo. This prevents passing in an empty string, but we can get the same effect by passing None.
Related issue number
Closes#23765