Commit graph

6729 commits

Author SHA1 Message Date
Sriram Sankar
608aa771b9
Clean up interaction between Autoscaler and Kuberay (#23428)
This PR updates the KuberayNodeProvider for a more robust interaction with the KubeRay operator.
2022-04-12 14:31:27 -07:00
Siyuan (Ryans) Zhuang
ef7180365d
[serialization] Enable debugging into pickle backend (#23854)
* enable debugging cloudpickle
2022-04-12 13:48:35 -07:00
Kai Fricke
4cb6205726
[tune] Fix empty CSV headers on trial restart (#23860)
What: Only open (create) CSV files when actually reporting results.
Why: When trials crash before they report first (e.g. on init), they will have created an empty CSV file. When results are subsequently written, the CSV header is then missing.
2022-04-12 21:05:29 +01:00
Antoni Baum
ff60ebd4b3
[tune] Fix memory resources for head bundle (#23861)
Fixes memory and object_store_memory actor options not being set properly for the Tune trainable.
2022-04-12 20:56:05 +01:00
Kai Fricke
c30491d6ef
[tune] Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
What: Skips left-over checkpoint_tmp* directories when loading experiment analysis. Also loads iteration number from metadata file rather than parsing the checkpoint directory name.

Why: Sometimes temporary checkpoint directories are not deleted correctly when restoring (e.g. when interrupted). In these cases, they shouldn't be included in experiment analysis. Parsing their iteration number also failed, and should generally be done by reading the metadata file, not by inferring it from the directory name.
2022-04-12 17:09:03 +01:00
Kai Fricke
416cfb8753
[tune] Fix syncing between nodes in placement groups (#23864)
Break out of placement groups to make syncing work in tune/train trials.
2022-04-12 17:06:19 +01:00
Kai Fricke
7eb3543e93
[tune] Chunk file transfers in cross-node checkpoint syncing (#23804)
What: This introduces a general utility to synchronize directories between two nodes, derived from the RemoteTaskClient. This implementation uses chunked transfers for more efficient communication.

Why: Transferring files over 2GB in size leads to superlinear time complexity in some setups (e.g. local macbooks). This could be due to memory limits, swapping, or gRPC limits, and is explored in a different thread. To overcome this limitation, we use chunked data transfers which show quasi-linear scalability for larger files.
2022-04-12 13:45:07 +01:00
Clark Zinzow
7d262f886d
[data] Preserve block order when batch mapping using the actor compute model. (#23837)
This PR preserves block order when transforming under the actor compute model. Before this PR, we were submitting block transformations in reverse order and creating the output block list in completion order.
2022-04-12 08:42:27 +01:00
Siyuan (Ryans) Zhuang
e0a68f5076
[workflow] skip flaky tests (#23848)
* skip flaky tests
2022-04-11 23:08:56 -07:00
Siyuan (Ryans) Zhuang
d7ef546352
[core] Simplify options handling [Part 1] (#23127)
* handle options

* update doc

* fix serve
2022-04-11 20:49:58 -07:00
Antoni Baum
40646eecd4
[AIR] SklearnTrainer & Predictor interfaces (#23803)
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-04-11 15:11:42 -07:00
shrekris-anyscale
87d1f97e2e
[runtime_env] Add print statements to TestGC tests (#23716)
The tests in `TestGC` are flaky due to timeout ([example 1](https://buildkite.com/ray-project/ray-builders-branch/builds/6868#5540f19e-3669-46eb-a4ee-c71a1252f9ae), [example 2](https://buildkite.com/ray-project/ray-builders-branch/builds/6872#8912eb47-eb63-40c9-949f-a020a5f8f42d)):

<img width="1304" alt="Screen Shot 2022-04-05 at 11 30 04 AM" src="https://user-images.githubusercontent.com/92341594/161825080-c2fe3887-f87c-4175-924f-80ae9b371157.png">

This change adds print statements to the `TestGC` tests to detect where they're hanging.
2022-04-11 16:09:41 -05:00
Amog Kamsetty
d33483de3d
[Tune] Don't include nan metrics for best checkpoint (#23820)
Nan values do not have a well defined ordering. When sorting metrics to determine the best checkpoint, we should always filter out checkpoints that are associated with nan values.

Closes #23812
2022-04-11 12:51:00 -07:00
Balaji Veeramani
394e5ec1c2
[Train] Raise helpful error when required backend isn't installed (#23583)
Closes #22347
2022-04-11 10:46:32 -07:00
Antoni Baum
5dc958037e
[air] Refactor most_frequent SimpleImputer (#23706)
Takes care of the TODO left for SimpleImputer with most_frequent strategy by refactoring and optimising the logic for computing the most frequent value.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-04-11 18:20:44 +01:00
Eric Liang
1ff874e8e8
[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817) 2022-04-10 16:12:53 -07:00
Qing Wang
c1dee15613
[xlang] Hotly fix the importing error for python call Java. (#23734)
The PR https://github.com/ray-project/ray/pull/22820 introduced a API breakage for xlang usage, causing that `ray.java_actor_class` has not been available any longer from then on.

I'm fixing it in this PR. We should remove these top level APIs in 2.0 instead of minor versions.
2022-04-10 15:36:12 +08:00
Eric Liang
858d607b19
[data] Fix small doc issues (#23813) 2022-04-09 12:09:08 -07:00
Yi Cheng
9655851f32
[ray] Remove RAY_USER_SETUP_FUNCTION (#23780)
`RAY_USER_SETUP_FUNCTION` is not a public API and is also not used by ray internally. This PR removes this feature.
2022-04-08 22:43:57 -07:00
Kai Fricke
8c2e471265
[AIR] Add RLTrainer interface, implementation, and examples (#23465)
This PR adds a RLTrainer to Ray AIR. It works for both offline and online use cases. In offline training, it will leverage the datasets key of the Trainer API to specify a dataset reader input, used e.g. in Behavioral Cloning (BC). In online training, it is a wrapper around the rllib trainables making use of the parameter layering enabled by the Trainer API.
2022-04-08 17:16:42 -07:00
Siyuan (Ryans) Zhuang
6dc74f5808
[workflow] Deprecate "workflow.step" [Part 3 - events] (#23796)
* update workflow events
2022-04-08 16:09:55 -07:00
Jiajun Yao
e910f0abcf
[Test] Don't send usage data to server for unit tests (#23800)
There are two tests that are accidentally sending usage data to the server. This pr fixes that.
2022-04-08 16:02:30 -07:00
Amog Kamsetty
029517a037
[Train] Fix train.torch.get_device() for fractional GPU or multiple GPU per worker case (#23763)
Using the local rank as the device id only works if there is exactly 1 GPU per worker. Instead we should be using ray.get_gpu_ids() to determine which GPU device to use for the worker.
2022-04-08 14:35:06 -07:00
xwjiang2010
615bb7a503
[tuner] add kwargs to be compatible with tune.run offerings. (#23791) 2022-04-08 14:30:40 -07:00
Qing Wang
42b4cc4e72
[ray collective] Use Ray internal kv for gloo group. (#23633)
Ray use gcs in memory store by default instead of Redis, which cause gloo group doesn't work by default.
In this PR, we use Ray internal kv for the store of gloo group to replace the RedisStore by default, to make gloo group work well.

This PR depends on another PR in pygloo https://github.com/ray-project/pygloo/pull/10
2022-04-08 19:39:58 +08:00
Keshi Dai
c143391b34
Expose A100 in accelerators module (#23751)
NVIDIA_TESLA_A100 is added here but it's not exposed in accelerators module's __init__ file.
2022-04-07 11:29:27 -07:00
xwjiang2010
7c67a4f1d0
[tuner] update tuner doc (#23753) 2022-04-07 11:10:17 -07:00
Antoni Baum
434d457ad1
[tune] Improve missing search dependency info (#23691)
Replaces FLAML searchers with a dummy class that throws an informative error on init if FLAML is not installed, removes ConfigSpace import in BOHB example code, adds a note to examples using external dependencies.
2022-04-07 08:53:27 -07:00
shrekris-anyscale
a6bcb6cd1e
[serve] Create application.py (#23759)
The `Application` class is stored in `api.py`. The object is relatively standalone and is used as a dependency in other classes, so this change moves `Application` (and `ImmutableDeploymentDict`) to a new file, `application.py`.
2022-04-07 10:34:24 -05:00
shrekris-anyscale
0902ec537d
[serve] Include full traceback in deployment update error message (#23752)
When deployments fail to update, [Serve sets their status to UNHEALTHY and logs the error message](46465abd6d/python/ray/serve/deployment_state.py (L1507-L1511)). However, the message lacks a traceback, making it impossible to find what caused it. [For example](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SfGPJq8WWJUhAvmHHsDgJWUe?command-history-section=command_history):

```
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment:
'>' not supported between instances of 'NoneType' and 'int'.
```

It's not clear where `'>' not supported between instances of 'NoneType' and 'int'.` is being triggered.

The change includes the full traceback for this type of update failure. The new status message is easier to debug:

```
File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment:
Traceback (most recent call last):
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update
    running_replicas_changed |= self._check_and_update_replicas()
  File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas
    a = 1/0
ZeroDivisionError: division by zero
```

(I forced a divide-by-zero error to get this traceback).
2022-04-07 10:34:00 -05:00
shrekris-anyscale
64d98fb385
[serve] Add unit tests and better error messages to _store_package_in_gcs() (#23576)
This change adds new unit tests and error message to _store_package_in_gcs(). In particular, it tests the function's behavior when it fails to connect to the GCS.
2022-04-06 17:34:10 -07:00
Kai Fricke
d27e73f851
[ci] Pin prometheus_client to fix current test outages (#23749)
What: Pins prometheus_client to < 0.14.0, hopefully fixing today's CI outages
Why: New version of the python client (https://github.com/prometheus/client_python/releases) breaks our CI
2022-04-06 14:22:22 -07:00
Amog Kamsetty
8becbfa927
[Train] MLflow start run under correct experiment (#23662)
Start Mlflow run under correct mlflow experiment
2022-04-06 11:50:32 -07:00
Siyuan (Ryans) Zhuang
46465abd6d
[workflow] Deprecate "workflow.step" [Part 2 - most nested workflows] (#23728)
* remove workflow.step

* convert examples
2022-04-06 00:47:43 -07:00
Kai Fricke
c0e38e335c
Revert "Revert "[air] Better exception handling"" (#23733)
This reverts commit 5609f438dc.
2022-04-05 21:45:24 -07:00
Kai Fricke
5609f438dc
Revert "[air] Better exception handling (#23695)" (#23732)
This reverts commit fb50e0a70b.
2022-04-05 20:20:40 -07:00
xwjiang2010
99f64821b1
[tune] add tuner test (#23726)
Adds test for TorchTrainer+Tuner
2022-04-05 19:42:51 -07:00
Kai Fricke
fb50e0a70b
[air] Better exception handling (#23695)
What: Raise meaningful exceptions when invalid parameters are passed.
Why: We want to catch invalid parameters and guide users to use the API in the correct way.
2022-04-05 19:11:55 -07:00
Antoni Baum
252596af58
[AIR] Add config to Result, extend ResultGrid.get_best_config (#23698)
Adds a dynamic property to easily obtain `config` dict from `Result`, extends the `ResultGrid.get_best_config` method for parity with `ExperimentAnalysis.get_best_trial` (allows for using of mode and metric different to the one set in the Tuner).
2022-04-05 16:08:05 -07:00
Stephanie Wang
9813f2cce4
[datasets] Unify Datasets primitives on a common shuffle op (#23614)
Currently Datasets primitives repartition, groupby, sort, and random_shuffle all use different internal shuffle implementations. This PR unifies them on a single internal ShuffleOp class. This class exposes static methods for map and reduce which must be implemented by the specific higher-level primitive. Then the ShuffleOp.execute method implements a simple pull-based shuffle by submitting one map task per input block and one reduce task per output block.

Closes #23593.
2022-04-05 15:53:28 -07:00
Kai Fricke
dc994dbb02
[tune] Add RemoteTask based sync client (#23605)
If rsync/ssh is not available (as in kubernetes setups), Tune previously had no fallback mechanism to synchronize trial directories to the driver. This PR introduces a `RemoteTaskSyncer` trial syncer that uses ray remote tasks to ship file contents between nodes. The implementation utilizes tarfile to compress files for transfer, and it only transfers files that differ between the source and target directory to minimize network bandwidth usage.

The trial syncer works as follows:

1. It collects information about existing files in the target directory. This directory could be remote (when syncing up) or local (when syncing down).
2. It then schedules a `pack` task on the source node. This will always be a remote task so the future can be passed to the unpack task. The pack task will only pack files that are not existent or different in the target directory into a tarfile, which is returned as a bytes string
3. An `unpack` task in scheduled on the target node. This will always be a remote task so the future can be awaited in a call to `wait()`

A test is added to ensure that only modified files are transferred on subsequent sync ups/downs.

Finally, minor changes are made to the `Syncer`/`NodeSyncer` classes to allow passing `(ip, path)` tuples rather than rsync-style remote paths.
2022-04-05 21:35:25 +01:00
Chris K. W
9b79048963
Update error message for @ray.method (#23471)
Updates @ray.method error message to match the one for @ray.remote. Since the client mode version of ray.method is identical to the regular ray.method, deletes the client mode version and drops the client_mode_hook decorator (guessing that the client copy was added before client_mode_hook was introduced).

Also fixes what I'm guessing is a bug that doesn't allow both num_returns and concurrency_group to be specified at the same time (assert len(kwargs) == 1).

Closes #23271
2022-04-05 11:12:55 -07:00
Stephanie Wang
1c972d5d2d
[core] Spill at least the object fusion size instead of at most (#22750)
Copied from #22571:

Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, currently we spill at most the object fusion size when instead we should be spilling at least the fusion size. Then we use the max number of fused objects as a cap.

This PR fixes the fusion behavior so that we always spill at minimum the fusion size. If we reach the end of the spillable objects, and we are under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.

Increases some test timeouts to allow tests to pass.
2022-04-05 10:57:42 -07:00
Antoni Baum
ca6dfc8bb7
[AIR] Interface for HuggingFaceTorchTrainer (#23615)
Initial draft of the interface for HuggingFaceTorchTrainer.

One alternative for limiting the number of datasets in datasets dict would be to have the user pass train_dataset and validation_dataset as separate arguments, though that would be inconsistent with TorchTrainer.
2022-04-05 10:32:13 -07:00
liuyang-my
bdd3b9a0ab
[Serve] Unified Controller API for Cross Language Client (#23004) 2022-04-05 09:20:02 -07:00
jon-chuang
9c950e8979
[Core] Placement Group: Fix Flakey Test placement_group_test_5 and Typo (#23350)
placement_group_test_5 is flakey. Reason is requesting PG with exact object store memory as node. If object store has object, then PG scheduling fails.

Also fix bug - typo.
2022-04-05 05:33:43 -07:00
Gagandeep Singh
11baa22c1e
Split test_advanced_n.py and enabled cluster tests (#23524) 2022-04-05 01:34:57 -07:00
Gagandeep Singh
8c87117bc3
Uniformly distributed tasks among actors to utilize full concurrency (#23416)
* Uniformly distributed tasks among actors to utilize full concurrency

* Added test to ensure all tasks are launched at the same time

* Applied linting format
2022-04-05 01:05:41 -07:00
Matti Picus
96948a4a30
WINDOWS: skip flaky test (#23557)
Continuation of #23462 to try to get test_ray_init to pass consistently in CI. The skipped test passes locally, so only skip it on CI.
2022-04-05 00:56:43 -07:00
Jiajun Yao
5f37231842
Remove yapf dependency (#23656)
Yapf has been replaced by black.
2022-04-04 21:50:04 -07:00