Commit graph

7556 commits

Author SHA1 Message Date
Amog Kamsetty
acc4903db1
[AIR/Serve] Auto-enable GPU Prediction (#26549)
Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-08-29 13:47:56 -07:00
ZhuSenlin
c7a3bcc232
[Core] fix resource leak when cancel actor in phase of creating (#27742)
t is quite easy to cause resource/process leak when cancel an actor which constructor is time-consuming.
2022-08-29 09:38:16 -07:00
Jiajun Yao
e6b0d5f95d
Revert "Don't include script directory in sys.path if it's started via python -m (#28043)" (#28139)
This reverts commit b41ee37c3a.
2022-08-26 21:37:21 -07:00
Kilian Lieret
67a7481972
[docs/tune] Fix loguniform range in tune tutorial (#28131) 2022-08-26 17:08:00 -07:00
Amog Kamsetty
00f6273775
[Docs] [Tune] ResultGrid Docs and API reference (#28068)
Improve docstring for ResultGrid and show API reference and docstring in Tune API section.

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-26 16:50:35 -07:00
Kai Fricke
f59dcdc049
[tune] Re-enable progress metric detection (#28130)
The API cleanup in #27060 introduced a regression when merging latest master - changes from #26967 were effectively disabled, retaining cluttered output in rllib with verbose=2.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-26 16:44:20 -07:00
Kai Fricke
d0678b80ed
[rfc] [air/tune/train] Improve trial/training failure error printing (#27946)
When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.

The main problems here are:

1. Tracebacks are printed multiple times: In the remote worker and on the driver
2. Tracebacks include many internal wrappers

The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).

The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.

### Deduplicating traceback printing

The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable. 

Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).

To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.

### Removing internal wrappers from tracebacks

The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-26 15:02:38 -07:00
Antoni Baum
ea483ecf7a
[AIR][Docs] Clarify how LGBM/XGB trainers work (#28122) 2022-08-26 14:51:22 -07:00
Kai Fricke
3b3aa80ba3
[tune/ci] Fix link to SigOpt experiment API (#28127)
Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-26 14:10:53 -07:00
Jiajun Yao
b41ee37c3a
Don't include script directory in sys.path if it's started via python -m (#28043)
According to https://peps.python.org/pep-0338/
> The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module.

We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.
2022-08-26 13:27:08 -07:00
Jiajie Li
6c69ee9a97
Add actor_id in RayActorError (#27802)
For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error.

Signed-off-by: Jiajie Li <ljjsalt@gmail.com>
2022-08-26 10:46:08 -07:00
Kai Fricke
cf94a31e7a
[CI] Pin moto to < 4.0.0. (#28098) 2022-08-25 07:55:25 -07:00
Kai Fricke
e0725d1f1d
[docs/ci] Fix (some) broken linkchecks (#28087)
Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-25 04:41:35 -07:00
Steven Morad
ad2bf69548
[AIR; RLlib] Log histograms in wandb. (#28081) 2022-08-24 08:21:14 -07:00
Cheng Su
debe0cc91f
[Datasets] Re-enable Parquet sampling and add progress bar (#28021) 2022-08-22 16:59:26 -07:00
Dmitri Gekhtman
227aef381a
Update Kuberay version in CI. (#27967)
Updates KubeRay version used in CI to v0.3.0-rc.2 (which we expect to be identical to the final v0.3.0).
Also removes a couple of old files.

Will open a corresponding cherry pick in the Ray 2.0.0 branch.
The key thing to verify is that the CI autoscaling test passes here and in the PR and in the PR against the 2.0.0 branch.
2022-08-20 14:50:52 -07:00
Alex Wu
f886d9737c
[autoscaler][observability] Provide more detailed events when autoscaler fails to launch a node. (#27891)
This PR makes the autoscaler event system for node launches more detailed. In particular, it does 4 related things:

Less verbose logging for node provider exceptions (printed to logs only, not driver)
Don't print to driver "adding 1 node(s) of type ..." when nodes don't launch (still print it if the node launch is successful).
Print to driver "Failed to launch ..."
Don't log a full exception to the driver.
The full driver event looks like this

```
Failed to launch 1 node(s) of type quota. (InsufficientInstanceCapacity): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c.
```

Co-authored-by: Alex <alex@anyscale.com>
2022-08-19 16:27:02 -07:00
xiaofeng
af488e1cc2
Revert "Revert "[serve][xlang]Support deploying Python deployment from Java. …" (#27945) 2022-08-18 17:57:37 -07:00
Dmitri Gekhtman
98c90b8488
[clusters][docs] Provide urls to content, fix typos (#27936) 2022-08-18 11:33:04 -07:00
Jian Xiao
440ae620eb
Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964)
There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime.
This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.
2022-08-18 10:25:31 -07:00
Cheng Su
45e5e8c6ea
[Datasets] Customized serializer for Arrow JSON ParseOptions in read_json (#27911)
This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821.

Signed-off-by: Cheng Su <scnju13@gmail.com>
2022-08-18 10:00:56 -07:00
Chen Shen
6be4bf8be3
[hotfix] Fix pytest dependency in test_utils (#27956)
import pytest in test_utils breaks a bunch of test.
2022-08-17 12:16:08 -07:00
Jian Xiao
2878119ece
Optimize groupby/mapgroups performance (#27805)
For the following script, it took 75-90 mins to finish the groupby().map_groups() before, and with this PR it finishes in less than 10 seconds.

The slowness came from the  `get_boundaries` routine which linearly loop over each row in the Pandas DataFrame (note: there's just one block in the script below, which had multiple million rows). We make it 1) operate on numpy arrow,  2) use binary search and 3) use native impl of bsearch from numpy.

```
import argparse
import time
import ray
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow as pa
import ray
 
def transform_batch(df: pd.DataFrame):
   # Drop nulls.
   df['pickup_at'] =  pd.to_datetime(df['pickup_at'], format='%Y-%m-%d %H:%M:%S')
   df['dropoff_at'] =  pd.to_datetime(df['dropoff_at'], format='%Y-%m-%d %H:%M:%S')
   df['trip_duration'] = (df['dropoff_at'] - df['pickup_at']).dt.seconds
   df['pickup_location_id'].fillna(-1, inplace = True)
   df['dropoff_location_id'].fillna(-1, inplace = True)
   return df
 
def train_test(rows):
 # if the group is too small, it cannot be split for train/test
 if len(rows.index) < 4:
   print(f"Dataframe for LocID: {rows.index} is empty")
 else:
   train, test = train_test_split(rows)
   train_X = train[["dropoff_location_id"]]
   train_y = train[['trip_duration']]
   test_X = test[["dropoff_location_id"]]
   test_y = test[['trip_duration']]
   reg = LinearRegression().fit(train_X, train_y)
   reg.score(train_X, train_y)
   pred_y = reg.predict(test_X)
   reg.score(test_X, test_y)
   error = np.mean(pred_y-test_y)
   # format output in dataframe (the same format as input)
   data = [[reg.coef_, reg.intercept_, error]]
   return pd.DataFrame(data, columns=["coef", "intercept", "error"])
 
start = time.time()
rds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/01/", columns=['pickup_at', 'dropoff_at', "pickup_location_id", "dropoff_location_id"])
rds = rds.map_batches(transform_batch, batch_format="pandas")
grouped_ds = rds.groupby("pickup_location_id")
results = grouped_ds.map_groups(train_test)
taken = time.time() - start
```
2022-08-17 11:08:18 -07:00
Scott Graham
5567a38a70
Adding unique id to azure template to enable multiple clusters per resource group. Using unique id to set subnet random seed, change msi and vnet names, logging unique id, and adding it to filter vms in cluster. Example template files updated with comments. (#26392)
Why are these changes needed?
Adding support for deploying multiple clusters into the same azure resource group

Changes:

Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed)
Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster
Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets)
Updated yaml example files with new provider values and explanations
Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates
Related issue number
Closes #22996
Supersedes #22997

Signed-off-by: Scott Graham <scgraham@microsoft.com>

Signed-off-by: Scott Graham <scgraham@microsoft.com>
Co-authored-by: Scott Graham <scgraham@microsoft.com>
2022-08-17 09:24:26 -07:00
Antoni Baum
d449f8db27
[CI] Update upstream requirements for XGB/LGBM-Ray (#27908)
To include these in the latest docker images (and get rid of deprecation warnings), bump in requirements_upstream.txt.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-08-17 10:55:02 +02:00
Nikita Vemuri
4692e8d802
[core] Don't override external dashboard URL in internal KV store (#27901)
Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
2022-08-16 22:48:05 -07:00
Yi Cheng
87ce8480ff
[core] Add stats for the gcs backend for telemetry. (#27876)
## Why are these changes needed?

To get better understanding of how GCS FT is used, adding this metrics.

Test:
```
cat /tmp/ray/session_latest/usage_stats.json
{"usage_stats": {"ray_version": "3.0.0.dev0", "python_version": "3.9.12", "schema_version": "0.1", "source": "OSS", "session_id": "70d3ecd3-5b16-40c3-9301-fd05404ea92a", "git_commit": "{{RAY_COMMIT_SHA}}", "os": "linux", "collect_timestamp_ms": 1660587366806, "session_start_timestamp_ms": 1660587351586, "cloud_provider": null, "min_workers": null, "max_workers": null, "head_node_instance_type": null, "worker_node_instance_types": null, "total_num_cpus": 16, "total_num_gpus": null, "total_memory_gb": 16.10752945020795, "total_object_store_memory_gb": 8.053764724172652, "library_usages": ["serve"], "total_success": 0, "total_failed": 13, "seq_number": 13, "extra_usage_tags": {"serve_api_version": "v1", "gcs_storage": "redis", "serve_num_deployments": "1"}, "total_num_nodes": 2, "total_num_running_jobs": 2}}
```
2022-08-16 17:02:04 -07:00
Antoni Baum
7ff914b06e
[AIR][Docs] Set logging_strategy="epoch" for HF (#27917) 2022-08-16 16:45:46 -07:00
SangBin Cho
75051278d7
Fix the undocumented ray log error (#27887)
Looks like hidden=True commands cannot be documented on sphinx. I removed the add_alias and use the standard click API to rename the API from the name of the method
2022-08-16 14:28:09 -07:00
Richard Liaw
759fbd9502
[air][minor] Use drop_columns in docs (#27852) 2022-08-16 14:01:25 -07:00
Antoni Baum
5757909cd2
[AIR] load_best_model_at_end validation for HF (#27875)
Adds validation for TrainingArguments.load_best_model_at_end (will throw an error down the line if set to True), fixes validation for *_steps, adds test.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-08-16 10:52:05 -07:00
Peyton Murray
4d19c0222b
[AIR] Add rich notebook repr for DataParallelTrainer (#26335) 2022-08-16 08:51:14 -07:00
Dmitri Gekhtman
bceef503b2
[Kubernetes][docs] Restore legacy Ray operator migration discussion (#27841)
This PR restores notes for migration from the legacy Ray operator to the new KubeRay operator.

To avoid disrupting the flow of the Ray documentation, these notes are placed in a README accompanying the old operator's code.

These notes are linked from the new docs.

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2022-08-16 08:46:31 -07:00
xwjiang2010
91f506304d
[air] [checkpoint manager] handle nested metrics properly as scoring attribute. (#27715)
handle nested metrics properly as scoring attribute

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-16 17:43:58 +02:00
Alex Wu
c2abfdb2f7
[autoscaler][observability] Observability into when/why nodes fail to launch (#27697)
This change adds launch failures to the recent failures section of ray status when a node provider provides structured error information. For node providers which don't provide this optional information, there is now change in behavior.

For reference, when trying to launch a node type with a quota issue, it looks like the following. InsufficientInstanceCapacity is the standard term for this issue..

```
======== Autoscaler status: 2022-08-11 22:22:10.735647 ========
Node status
---------------------------------------------------------------
Healthy:
 1 cpu_4_ondemand
Pending:
 quota, 1 launching
Recent failures:
 quota: InsufficientInstanceCapacity (last_attempt: 22:22:00)

Resources
---------------------------------------------------------------
Usage:
 0.0/4.0 CPU
 0.00/9.079 GiB memory
 0.00/4.539 GiB object_store_memory

Demands:
 (no resource demands)
```

```
available_node_types:
    cpu_4_ondemand:
        node_config:
            InstanceType: m4.xlarge
            ImageId: latest_dlami
        resources: {}
        min_workers: 0
        max_workers: 0
    quota:
        node_config:
            InstanceType: p4d.24xlarge
            ImageId: latest_dlami
        resources: {}
        min_workers: 1
        max_workers: 1
```
Co-authored-by: Alex <alex@anyscale.com>
2022-08-15 18:14:29 -07:00
Jiajun Yao
06ef4ab94e
Fix broken links in the code (#27873)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-15 13:11:42 -07:00
Archit Kulkarni
058c239cf1
[runtime env] Test common failure scenarios (#25977)
Tests the following failure scenarios:
- Fail to upload data in `ray.init()` (`working_dir`, `py_modules`)
- Eager install fails in `ray.init()` for some other reason (bad `pip` package)
- Fail to download data from GCS (`working_dir`)

Improves the following error message cases:
- Return RuntimeEnvSetupError on failure to upload working_dir or py_modules
- Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup

Not covered in this PR:
- RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.)
- Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.)

The approach is to use environment variables to induce failures in various places.  The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be.  I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.
2022-08-15 11:35:56 -05:00
SangBin Cho
d654636bfc
[Test] Fix broken test_base_trainer (#27855)
The test was written incorrectly. This root cause was that the trainer & worker both requires 1 CPU, meaning pg requires {CPU: 1} * 2 resources.

And when the max fraction is 0.001, we only allow up to 1 CPU for pg, so we cannot schedule the requested pgs in any case.
2022-08-15 07:50:18 -07:00
SangBin Cho
9ece110d27
[State Observability] Promote the API to alpha (#27788)
# Why are these changes needed?

- Promote APIs to PublicAPI(alpha)
- Change pre-alpha -> alpha
- Fix a bug ray_logs is displayed to ray --help

Release test result: #26610
Some APIs are subject to change at the beta stage (e.g., ray list jobs or ray logs).
2022-08-13 23:43:01 -07:00
SangBin Cho
999715ebec
[Core][Placement Group] Handling edge cases of max_cpu_fraction argument (#27035)
Why are these changes needed?
This PR fixes the edge cases when the max_cpu_fraction argument is used by the placement group. There was specifically an edge case where the placement group cannot be scheduled when a task or actor is scheduled and occupies the resource.

The original logic to decide if the bundle scheduling exceed CPU fraction was as follow.

Calculate max_reservable_cpus of the node.
Calculate currently_used_cpus + bundle_cpu_request (per bundle) == total_allocation of the node.
Don't schedule if total_allocation > max_reservable_cpus for the node.
However, the following scenarios caused issues because currently_used_cpus can include resources that are not allocated by placement groups (e.g., actors). As a result, when the actor was already occupying the resource, the total_allocation was incorrect. For example,

4 CPUs
0.999 max fraction (so it can reserve up to 3 CPUs)
1 Actor already created (1 CPU)
PG with CPU: 3
Now pg cannot be scheduled because total_allocation == 1 actor (1 CPU) + 3 bundles (3 CPUs) == 4 CPUs > 3 CPUs (max frac CPUs)
However, this should work because the pg can use up to 3 CPUs, and we have enough resources.
The root cause is that when we calculate the max_fraction, we should only take into account of resources allocated by bundles. To fix this, I change the logic as follow.

Calculate max_reservable_cpus of the node.
Calculate **currently_used_cpus_by_pg_bundles** + **bundle_cpu_request (sum of all bundles)** == total_allocation_from_pgs_and_bundles of the node.
Don't schedule if total_allocation_from_pgs_and_bundles > max_reservable_cpus for the node.
2022-08-12 17:40:11 -07:00
Sihan Wang
7e7c93f6ba
[Serve] Fix memory leak issue in serve inference (#27815) 2022-08-12 17:11:37 -07:00
zcin
8cb09a9fc5
Revert "Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications"" (#27662) 2022-08-12 15:12:20 -07:00
Clark Zinzow
f0404e00cd
[Core] [Hotfix] Change "task failed with unretryable exception" log statement to debug-level. (#27714)
Serve relies on being able to do quiet application-level retries, and this info-level logging is resulting in log spam hitting users. This PR demotes this log statement to debug-level to prevent this log spam.

Co-authored-by: simon-mo <simon.mo@hey.com>
2022-08-12 11:28:49 -07:00
Cheng Su
7c7828f818
[Datasets] Improve size estimation of image folder data source (#27219)
This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in https://github.com/ray-project/ray/pull/27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```
2022-08-12 11:26:03 -07:00
matthewdeng
58495fe594
[data][docs] fix broken links (#27818) 2022-08-12 11:17:34 -07:00
Archit Kulkarni
6c45625d6d
[runtime env] [CI] Skip flaky test_runtime_env_working_dir_2 tests on mac (#27799) 2022-08-12 09:39:19 -07:00
Archit Kulkarni
518c74020c
[Serve] [Doc] Serve add API ref for Deployment.bind() and serve.build (#27811) 2022-08-12 09:38:58 -07:00
Simon Mo
bf9f0621b9
[Serve] Minor fix to replica shutdown (#27778) 2022-08-12 09:33:08 -07:00
Simon Mo
0badbb8b1e
[Serve][docs] Refresh http-guide (#27779)
- Moved most code snippet to doc_code
- Added section about DAGDriver
- Added section discussing when should you use each abstraction layer.
2022-08-12 11:06:36 -05:00
shrekris-anyscale
e15960ed7e
[Serve] [Docs] Update the "Monitoring Ray Serve" Page (#27777)
The "Monitoring Ray Serve" page explains how to inspect your Ray Serve applications. This change updates the page to remove outdated metrics that Serve no longer exposes and to upgrade code samples to use 2.0 APIs. It also improves the content's readability and organization.

Link to updated "Monitoring Ray Serve" page: https://ray--27777.org.readthedocs.build/en/27777/serve/monitoring.html
2022-08-12 11:05:31 -05:00