Updates KubeRay version used in CI to v0.3.0-rc.2 (which we expect to be identical to the final v0.3.0).
Also removes a couple of old files.
Will open a corresponding cherry pick in the Ray 2.0.0 branch.
The key thing to verify is that the CI autoscaling test passes here and in the PR and in the PR against the 2.0.0 branch.
This PR makes the autoscaler event system for node launches more detailed. In particular, it does 4 related things:
Less verbose logging for node provider exceptions (printed to logs only, not driver)
Don't print to driver "adding 1 node(s) of type ..." when nodes don't launch (still print it if the node launch is successful).
Print to driver "Failed to launch ..."
Don't log a full exception to the driver.
The full driver event looks like this
```
Failed to launch 1 node(s) of type quota. (InsufficientInstanceCapacity): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c.
```
Co-authored-by: Alex <alex@anyscale.com>
Take out the CLI reference from the core API subsection. It follows the same CLI reference pattern as other library (e.g., Serve has Serve CLI under Serve API section).
There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime.
This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.
This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821.
Signed-off-by: Cheng Su <scnju13@gmail.com>
Move the code to doc_code
Fix the code example to make batching faster than serial run.
Related issue number
#27048
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
An attempt at making the docs shorter and sweeter including various small cleanup items.
- Reorder the TOC on the sidebar for the user guides to be more linear based on a user's journey.
- Put the batching content under the performance guide.
- Remove the AIR guide (AIR users already have a serving guide).
- Combine the `ServeHandle` and model composition pages into a single guide. We may want to revisit this in the future but for now better to have it in a single place instead of duplicated (with links going to both).
- Fix the index page for the user guides to match the TOC sidebar.
- Rename a few pages for clarity & consistency.
- Remove some now-redundant content (old ML models user guide).
For the following script, it took 75-90 mins to finish the groupby().map_groups() before, and with this PR it finishes in less than 10 seconds.
The slowness came from the `get_boundaries` routine which linearly loop over each row in the Pandas DataFrame (note: there's just one block in the script below, which had multiple million rows). We make it 1) operate on numpy arrow, 2) use binary search and 3) use native impl of bsearch from numpy.
```
import argparse
import time
import ray
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow as pa
import ray
def transform_batch(df: pd.DataFrame):
# Drop nulls.
df['pickup_at'] = pd.to_datetime(df['pickup_at'], format='%Y-%m-%d %H:%M:%S')
df['dropoff_at'] = pd.to_datetime(df['dropoff_at'], format='%Y-%m-%d %H:%M:%S')
df['trip_duration'] = (df['dropoff_at'] - df['pickup_at']).dt.seconds
df['pickup_location_id'].fillna(-1, inplace = True)
df['dropoff_location_id'].fillna(-1, inplace = True)
return df
def train_test(rows):
# if the group is too small, it cannot be split for train/test
if len(rows.index) < 4:
print(f"Dataframe for LocID: {rows.index} is empty")
else:
train, test = train_test_split(rows)
train_X = train[["dropoff_location_id"]]
train_y = train[['trip_duration']]
test_X = test[["dropoff_location_id"]]
test_y = test[['trip_duration']]
reg = LinearRegression().fit(train_X, train_y)
reg.score(train_X, train_y)
pred_y = reg.predict(test_X)
reg.score(test_X, test_y)
error = np.mean(pred_y-test_y)
# format output in dataframe (the same format as input)
data = [[reg.coef_, reg.intercept_, error]]
return pd.DataFrame(data, columns=["coef", "intercept", "error"])
start = time.time()
rds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/01/", columns=['pickup_at', 'dropoff_at', "pickup_location_id", "dropoff_location_id"])
rds = rds.map_batches(transform_batch, batch_format="pandas")
grouped_ds = rds.groupby("pickup_location_id")
results = grouped_ds.map_groups(train_test)
taken = time.time() - start
```
Why are these changes needed?
Adding support for deploying multiple clusters into the same azure resource group
Changes:
Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed)
Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster
Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets)
Updated yaml example files with new provider values and explanations
Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates
Related issue number
Closes#22996
Supersedes #22997
Signed-off-by: Scott Graham <scgraham@microsoft.com>
Signed-off-by: Scott Graham <scgraham@microsoft.com>
Co-authored-by: Scott Graham <scgraham@microsoft.com>
To include these in the latest docker images (and get rid of deprecation warnings), bump in requirements_upstream.txt.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
# Why are these changes needed?
(map pid=516, ip=172.31.64.223) E0526 12:32:19.203322360 675 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". See [this](https://github.com/ray-project/ray/issues/25367#issuecomment-1189421372) for more details.
We currently see this in many of the large nightly tests.
# Root Cause
The root cause (with pretty high confidence level) has been some misconfigs between gRPC server/clients. Essentially the client is pinging the server too frequently for keep-alive heartbeats.
# Mitigation
This PR is merely a mitigation step. I will keep looking into the exact client/server pair later, but probably don't have bandwidth for now largely because the test iteration takes quite a while and verbose logging with gRPC and ray backend have not revealed much useful info. This only kicks in at the end of a long running map phase, and verbose logging doesn't tell me which client is sending the pings.
Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
- Adds KubeRay information to the production guide.
- Consolidates the two user guides we had related to production deployment.
- Adds information about experimental GCS HA feature.
Signed-off-by: Yi Cheng 74173148+iycheng@users.noreply.github.com
Why are these changes needed?
This PR update workflow doc to reflect the recent change.
Focusing on position change and others.
Looks like hidden=True commands cannot be documented on sphinx. I removed the add_alias and use the standard click API to rename the API from the name of the method
Different metrics are collected in Ray Serve when the deployments are called from HTTP vs Python. This needs to be mentioned in the documentation and each metric marked accordingly.
Enables better usage with GCP.
The default behavior is that the head runs with the ray-autoscaler-sa-v1 service Account, but workers do not. Workers can run with this service account by copying & uncommenting L114->L117 from example-full
Signed-off-by: Ian <ian.rodney@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>