Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com
Adds back more Ray Train APIs to Ray Train docs.
Also makes updates to the user guide for better references.
Speeds up HuggingFaceTrainer/Predictor tests in CI by around ~20% by switching to a different GPT model. This is the same model Hugging Face team uses for their own CI.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
CI is red because of a dependency issue around dataclass_transform .
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook.
Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template.
Signed-off-by: pdmurray <peynmurray@gmail.com>
PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this.
Signed-off-by: Balaji Veeramani balaji@anyscale.com
this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR
…ment from Java. …" (#27945)"
This reverts commit af488e1.
If this passes, it should be preferred over #28098.
Adjust moto setup to use new API.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Improve docstring for ResultGrid and show API reference and docstring in Tune API section.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
The API cleanup in #27060 introduced a regression when merging latest master - changes from #26967 were effectively disabled, retaining cluttered output in rllib with verbose=2.
Signed-off-by: Kai Fricke <kai@anyscale.com>
When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.
The main problems here are:
1. Tracebacks are printed multiple times: In the remote worker and on the driver
2. Tracebacks include many internal wrappers
The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).
The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.
### Deduplicating traceback printing
The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable.
Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).
To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.
### Removing internal wrappers from tracebacks
The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening.
Signed-off-by: Kai Fricke <kai@anyscale.com>
According to https://peps.python.org/pep-0338/
> The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module.
We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.
For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error.
Signed-off-by: Jiajie Li <ljjsalt@gmail.com>
Updates KubeRay version used in CI to v0.3.0-rc.2 (which we expect to be identical to the final v0.3.0).
Also removes a couple of old files.
Will open a corresponding cherry pick in the Ray 2.0.0 branch.
The key thing to verify is that the CI autoscaling test passes here and in the PR and in the PR against the 2.0.0 branch.
This PR makes the autoscaler event system for node launches more detailed. In particular, it does 4 related things:
Less verbose logging for node provider exceptions (printed to logs only, not driver)
Don't print to driver "adding 1 node(s) of type ..." when nodes don't launch (still print it if the node launch is successful).
Print to driver "Failed to launch ..."
Don't log a full exception to the driver.
The full driver event looks like this
```
Failed to launch 1 node(s) of type quota. (InsufficientInstanceCapacity): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c.
```
Co-authored-by: Alex <alex@anyscale.com>
There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime.
This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.
This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821.
Signed-off-by: Cheng Su <scnju13@gmail.com>
For the following script, it took 75-90 mins to finish the groupby().map_groups() before, and with this PR it finishes in less than 10 seconds.
The slowness came from the `get_boundaries` routine which linearly loop over each row in the Pandas DataFrame (note: there's just one block in the script below, which had multiple million rows). We make it 1) operate on numpy arrow, 2) use binary search and 3) use native impl of bsearch from numpy.
```
import argparse
import time
import ray
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow as pa
import ray
def transform_batch(df: pd.DataFrame):
# Drop nulls.
df['pickup_at'] = pd.to_datetime(df['pickup_at'], format='%Y-%m-%d %H:%M:%S')
df['dropoff_at'] = pd.to_datetime(df['dropoff_at'], format='%Y-%m-%d %H:%M:%S')
df['trip_duration'] = (df['dropoff_at'] - df['pickup_at']).dt.seconds
df['pickup_location_id'].fillna(-1, inplace = True)
df['dropoff_location_id'].fillna(-1, inplace = True)
return df
def train_test(rows):
# if the group is too small, it cannot be split for train/test
if len(rows.index) < 4:
print(f"Dataframe for LocID: {rows.index} is empty")
else:
train, test = train_test_split(rows)
train_X = train[["dropoff_location_id"]]
train_y = train[['trip_duration']]
test_X = test[["dropoff_location_id"]]
test_y = test[['trip_duration']]
reg = LinearRegression().fit(train_X, train_y)
reg.score(train_X, train_y)
pred_y = reg.predict(test_X)
reg.score(test_X, test_y)
error = np.mean(pred_y-test_y)
# format output in dataframe (the same format as input)
data = [[reg.coef_, reg.intercept_, error]]
return pd.DataFrame(data, columns=["coef", "intercept", "error"])
start = time.time()
rds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/01/", columns=['pickup_at', 'dropoff_at', "pickup_location_id", "dropoff_location_id"])
rds = rds.map_batches(transform_batch, batch_format="pandas")
grouped_ds = rds.groupby("pickup_location_id")
results = grouped_ds.map_groups(train_test)
taken = time.time() - start
```
Why are these changes needed?
Adding support for deploying multiple clusters into the same azure resource group
Changes:
Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed)
Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster
Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets)
Updated yaml example files with new provider values and explanations
Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates
Related issue number
Closes#22996
Supersedes #22997
Signed-off-by: Scott Graham <scgraham@microsoft.com>
Signed-off-by: Scott Graham <scgraham@microsoft.com>
Co-authored-by: Scott Graham <scgraham@microsoft.com>
To include these in the latest docker images (and get rid of deprecation warnings), bump in requirements_upstream.txt.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
Looks like hidden=True commands cannot be documented on sphinx. I removed the add_alias and use the standard click API to rename the API from the name of the method