Commit graph

7559 commits

Author SHA1 Message Date
Alex Wu
a6b9019d38
[log_monitor] Seek when reopening a file due to inode change (#27508)
When reopening a file due to an inode change, we weren't seeking back to the right location. Now we are (with a unit test).

Closes (but not really until it's cherry-picked) #27507

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-08-05 18:27:43 -07:00
Clark Zinzow
293452dcba
[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449)
This reverts commit cf7305a, and unreverts #25896.

This was reverted due to a failing Windows test: #26287

We can merge once the failing Windows test (and all other relevant tests) pass.
2022-08-05 16:07:13 -07:00
Simon Mo
f6d19ac7c0
[Serve] Gate the deprecation warnings behind envvar (#27479) 2022-08-05 13:38:44 -07:00
Clark Zinzow
313d553cfc
[Datasets] Avoid unnecessary reads when truncating a dataset with ds.limit() (#27343)
Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data.

This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.
2022-08-05 13:35:40 -07:00
Clark Zinzow
bfc38de009
[Datasets] [Docs] Improve .limit() and .take() docstrings (#27367)
Improve docstrings for .limit() and .take(), making the distinction more clear.

Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-08-05 12:17:24 -07:00
Richard Liaw
4629a3a649
[air/docs] Update Trainer documentation (#27481)
Co-authored-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-08-05 11:21:19 -07:00
zcin
22db41c21a
[Serve][doc] Modify and Combine Tensorflow, Pytorch, Sklearn Tutorials (#26817) 2022-08-05 11:55:31 -05:00
zcin
8a9d994dd0
[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403)
Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.
2022-08-05 11:31:00 -05:00
zcin
b5927caaae
[serve] Update version if import_path or runtime_env in config is changed (#27498)
Previous PR that adds in lightweight config updates: https://github.com/ray-project/ray/pull/27000. It only tracks the config options for `deployments` (bumps version if certain deployment options are changed, but otherwise keeps versions the same). However we should bump the versions of all deployments if `import_path` or `runtime_env` is changed.
2022-08-05 11:30:22 -05:00
Jialing He
ccf411604e
Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308) 2022-08-05 16:32:48 +08:00
Archit Kulkarni
1714d0266b
[Doc] [Serve] Refresh code for "monitoring" for 2.0 (#27400) 2022-08-04 20:10:12 -07:00
Dmitri Gekhtman
b1d838446c
[autoscaler] Fix Prometheus metric autoscaler hang bug (#27532)
Failed node launch can lead to an extra unexpected error in the node launcher due to the definition of a mock prometheus metric method.
This failure leads to a permanently hanging autoscaler with "launching nodes" never cleared out and the autoscaler unable to proceed to launch nodes.

This PR fixes the method signature leading to the unexpected failure.
2022-08-04 19:48:31 -07:00
Alex Wu
eb9c5d8fa7
[autoscaler][aws] Bump max keys per account (#27506)
Signed-off-by: Alex Wu <alex@anyscale.io>

This is a minor QoL improvement to bump the hardcoded limit for number of aws keys per account. The limit is arbitrary and has been bumped before. AFAICT the fundamental aws limit is a 5000 key per region limit which we are not close to.
2022-08-04 15:12:55 -07:00
Richard Liaw
b2cd34cc5c
[air] Remove checkpoint user guide and update key concepts and docstring (#27455) 2022-08-04 08:55:26 -07:00
xwjiang2010
8d5c07b781
[air/train/docs] Add trainer user guide and update trainer docs (#27389)
This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections.

The main file to review is here: doc/source/ray-air/trainer.rst.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-08-04 13:59:50 +01:00
Kai Fricke
b6765bb4f3
[air/tune/train] Update/fix API annotations (#27428)
This bumps annotations to beta or demotes to DeveloperAPI

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-04 09:05:04 +01:00
Ricky Xu
8498a56fe2
[Core][fix] Increasing timeout on non-windows for test_metrics (#27379)
The test was timing out.

A normal pass was ~17secs.
2022-08-03 15:22:00 -07:00
Alan Guo
2cf9ecf48e
Make it so pydantic is required before we launch dashboard api server (#27345)
* Make it so pydantic is required before we launch dashboard api server

Signed-off-by: Alan Guo <aguo@anyscale.com>
2022-08-03 14:24:51 -07:00
Balaji Veeramani
fd381927c1
[AIR] Add optional mode parameter and make size parameter optional (#27295)
1. If a user reads a folder with grayscale and color images, ImageFolderDatasource errors.
2. There's no way to retain image shapes.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-08-03 13:20:46 -07:00
zcin
286343601a
[Serve] Enable lightweight config update (#27000) 2022-08-03 11:49:41 -05:00
xwjiang2010
ff2b728e9a
[air] add tuner user guide (#26837)
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-03 09:43:42 -07:00
Kai Fricke
20119c7022
[tune] Fix test_actor_reuse.py::ActorReuseMultiTest test (#27427)
Increase time to allow for scheduling latency

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-03 13:54:11 +01:00
Kai Fricke
46ed3557ba
[tune] Fix test_resource_exhausted_info test (#27426)
#27213 broke this test

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-03 13:53:46 +01:00
Simon Mo
4e07019b88
[Serve] Fix Graph Repeated Invocation (#27417) 2022-08-03 01:40:19 -07:00
shrekris-anyscale
adc7c4dc87
[Serve] Make serve.run() and deployment.bind() beta APIs (#27401) 2022-08-02 23:11:23 -07:00
Simon Mo
6084eb6a9f
Revert "Revert "[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685)" (#27283)" (#27348) 2022-08-02 20:04:03 -07:00
Richard Liaw
6dc3dbdd37
[air] Update to beta (#27393)
Update API references to beta. Needed as we are going to beta in 2.0.

I left out RL/Scikit-Learn/HuggingFace.
2022-08-02 17:10:41 -07:00
Eric Liang
91a03026ef
[air] Fix BatchPredictor.predict_pipelined not working with GPU stage (#27232) 2022-08-02 15:36:40 -07:00
Clark Zinzow
291a294208
[AIR - Serve] [Hotfix] Check for tensor extension via dtype rather than a NumPy conversion (#26891)
Converting a Pandas DataFrame column to an ndarray (e.g. via df[col].values) can often result in a full copy of the column in order to construct the ndarray due to Pandas' 2D block management. This PR ports tensor extension type checking to checking the dtype, which is always an O(1) check.

Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-08-02 14:52:46 -07:00
Ricky Xu
122eda2757
[Core] Move test_state_api test back to large test groups (#27377)
Why are these changes needed?
python/tests/test_state_api.py runs for 5min in normal run
2022-08-02 14:21:34 -07:00
Simon Mo
a9d94f740c
[Serve] Remove the warning for async handles in 2.0 (#27346)
Signed-off-by: simon-mo <simon.mo@hey.com>
2022-08-02 15:07:41 -05:00
Jiajun Yao
cd2e590567
Support placement_group=None in PlacementGroupSchedulingStrategy (#27370)
We decided to allow escaping the parent pg via `PlacementGroupSchedulingStrategy(placement_group=None)` instead of using "DEFAULT". Our doc is updated with that but in the code it's still not allowed.
2022-08-02 12:49:41 -07:00
Eric Liang
a1cb735035
Raise the (runtime_env max size) gRPC max message size to 500MiB
Signed-off-by: Eric Liang <ekhliang@gmail.com>
2022-08-02 12:41:34 -07:00
Ricky Xu
82a24f9319
[Doc][Core][State Observability] Adding Python SDK doc and docstring (#26997)
1. Add doc for python SDK and docstrings on public SDK
2. Rename list -> ray_list and get -> ray_get for better naming 
3. Fix some typos 
4. Auto translate address to api server url.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2022-08-02 11:24:59 -05:00
Kai Fricke
149c031c4b
[tune/release] Do not use spot instances in k8s tests (#27250)
Spot instances are not being booted up, so let's go without them.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-02 11:30:41 +01:00
Yi Cheng
a9697722cf
[workflow] Change step to task in workflow. (#27330)
* change step to task

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

* fix comments

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

* fix comments

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

* fix comments

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2022-08-01 22:27:41 -07:00
Yi Cheng
00d22b6c7c
[core] Fix the test_failure_3.py in win (#27332)
Win tests were broken because when the child is killed, the parent is also killed. Change the signal sent and make it work.
2022-08-01 18:55:07 -07:00
shrekris-anyscale
324d8e4bca
[Serve] Serialize user_config with JSON instead of Pickle (#26235) 2022-08-01 17:53:43 -07:00
Eric Liang
f7ae8923f6
[docs] Reorganize the tensor data support docs; general editing (#26952)
Why are these changes needed?
Editing pass over the tensor support docs for clarity:

Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
2022-08-01 17:31:41 -07:00
shrekris-anyscale
cc84953da3
[Serve] [Docs] Update "Getting Started" documentation (#26745) 2022-08-01 16:31:48 -07:00
jonathan-conder-sm
1d5fef2004
Fix dashboard with prometheus-client 0.14 (#23766)
Why are these changes needed?

The dashboard wasn't working (blank screen). See the linked issue for details. The cause is this exception in /tmp/ray/session_latest/logs/dashboard_agent.log:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/usr/local/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
There was a recent change in prometheus-client which passes the address given to start_http_server to socket.getaddrinfo. This prevents passing in an empty string, but we can get the same effect by passing None.

Related issue number
Closes #23765
2022-08-01 10:25:38 -07:00
Sihan Wang
410fe1b5ec
[Serve] Support Multiple DAG Entrypoints in DAGDriver (#26573) 2022-08-01 09:16:36 -07:00
matthewdeng
83fb2fb21d
[tune] pin pymoo (#27311)
Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-07-31 01:06:27 -07:00
SangBin Cho
ec69fec1e0
Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302)" (#27242)
This reverts commit 14dee5f6a3.
2022-07-30 00:08:23 -07:00
Yi Cheng
7da700e337
[core] Suppress the logging error when python exits and actor not deleted. (#27300)
In py39, it seems the destruction order changed and in __del__ some component might have been uninstalled and some error will be thrown.
2022-07-29 23:12:41 -07:00
Eric Liang
e3d7930bb3
[data] When no pipeline stats are available, return a helpful message instead of empty #
Signed-off-by: Eric Liang <ekhliang@gmail.com>
2022-07-29 15:54:20 -07:00
Chris K. W
18109ec9f5
Deflake test_client_library_integration (#27105)
Using ray_start_regular_shared test_tune_library_integration seems to make test_serve_handle flake. Separate use ray_start_regular instead.

No flake: 
<img width="610" alt="Screen Shot 2022-07-27 at 1 10 59 PM" src="https://user-images.githubusercontent.com/14043490/181363214-522e9f41-df59-4b84-89b1-d8399b1901c6.png">
2022-07-29 15:50:54 -07:00
Simon Mo
1a10b53a61
Revert "[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685)" (#27283)
This reverts commit 545c51609f.
2022-07-29 14:24:15 -07:00
Eric Liang
467430dd55
Fix logger initialization (#27238) 2022-07-29 10:55:15 -07:00
Simon Mo
545c51609f
[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685) 2022-07-29 09:50:17 -07:00