Commit graph

6997 commits

Author SHA1 Message Date
Kai Fricke
62414525f9
[tune] Optuna should ignore additional results after trial termination (#23495)
In rare cases (#19274) (and possibly old versions of Ray), buffered results can lead to calling on_trial_complete multiple times with the same trial ID. In these cases, Optuna should gracefully handle this case and discard the results.
2022-03-28 20:07:41 +01:00
shrekris-anyscale
aae144d7f9
[serve] Make Serve CLI and serve.build() non-public (#23504)
This change makes `serve.build()` non-public and hides the following Serve CLI commands:
* `deploy`
* `config`
* `delete`
* `build`
2022-03-28 10:40:57 -07:00
Kai Fricke
1465eaa306
[tune] Use new Checkpoint interface internally (#22801)
Follow up from #22741, also use the new checkpoint interface internally. This PR is low friction and just replaces some internal bookkeeping methods.

With the new Checkpoint interface, there is no need to revamp the save/restore APIs completely. Instead, we will focus on the bookkeeping part, which takes place in the Ray Tune's and Ray Train's checkpoint managers. These will be consolidated in a future PR.
2022-03-28 18:33:40 +01:00
mwtian
d1ef498638
[Python Worker] load actor dependency without importer thread (#23383)
Import actor dependency when not found, so actor dependencies can be imported without the importer thread.

Remaining blockers to remove importer thread are to support running a function on all workers `run_function_on_all_workers()`, and raising a warning when the same function / class is exported too many times.
2022-03-27 15:09:08 -07:00
shrekris-anyscale
65d72dbd91
[serve] Make serve.shutdown() shut down remote Serve applications (#23476) 2022-03-25 18:27:34 -05:00
Amog Kamsetty
7fd7efc8d9
[AIR] Do not deepcopy RunConfig (#23499)
RunConfig is not a tunable hyperparameter, so we do not need to deep copy it when merging parameters with Ray Tune's param_space.
2022-03-25 13:12:17 -07:00
Edward Oakes
cf7b4e65c2
[serve] Implement serve.build (#23232)
The Serve REST API relies on YAML config files to specify and deploy deployments. This change introduces `serve.build()` and `serve build`, which translate Pipelines to YAML files.

Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
2022-03-25 13:36:59 -05:00
shrekris-anyscale
be216a0e8c
[serve] Raise error in test_local_store_recovery (#23444) 2022-03-25 13:36:51 -05:00
dependabot[bot]
e69f7f33ee
[tune](deps): Bump optuna in /python/requirements/ml (#19669)
Bumps [optuna](https://github.com/optuna/optuna) from 2.9.1 to 2.10.0.
- [Release notes](https://github.com/optuna/optuna/releases)
- [Commits](https://github.com/optuna/optuna/compare/v2.9.1...v2.10.0)

---
updated-dependencies:
- dependency-name: optuna
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-03-25 17:58:22 +00:00
Jan Weßling
f78404da4a
[serve] Add ensemble model example to docs (#22771)
Added ensemble model examples to the Documentation. That was needed, due to a user request and there was no methodology outlining the creation of higher level ensemble models.

Co-authored-by: Jiao Dong <sophchess@gmail.com>
2022-03-25 11:17:54 -05:00
ddelange
e109c13b83
[ci] Clean up ray-ml requirements (#23325)
In https://github.com/ray-project/ray/blob/ray-1.11.0/docker/ray-ml/Dockerfile, the order of pip install commands currently matters (potentially a lot). It would be good to run one big pip install command to avoid ending up with a broken env.

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-03-25 15:59:54 +00:00
Maxim Egorushkin
3e7ef04203
Don't rsync checkpoint_tmp directories. (#18434)
checkpoint_tmpxxxxxx directories must not be synced from the worker nodes to the head node.

Co-authored-by: Maxim Egorushkin <maxim.egorushkin@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-03-25 15:50:38 +00:00
Antoni Baum
ebb592b2ca
[Tune/Train] Make MLflowLoggerUtil copyable (#23333)
Makes sure mlflow module is not saved as an attribute in MLflowLoggerUtil, which was causing an exception when running deepcopy on the callback.
2022-03-24 17:48:02 -07:00
Max Pumperla
60054995e6
[docs] fix doctests and activate CI (#23418) 2022-03-24 17:04:02 -07:00
Siyuan (Ryans) Zhuang
d39cef5725
[workflow] Deprecate "workflow.step" [Part 1 - most common cases] (#23456)
* replacing workflow decorator

* replacing workflow decorator
2022-03-24 13:11:46 -07:00
Stephanie Wang
d67a4f5c88
[datasets] Fix missing arg in RandomIntRowDatasource (#23255)
Previously failed with
```
E                       ray.exceptions.RayTaskError(TypeError): ray::_prepare_read() (pid=166631, ip=10.103.212.102)
E                         File "/home/swang/ray/python/ray/data/read_api.py", line 902, in _prepare_read
E                           return ds.prepare_read(parallelism, **kwargs)
E                         File "/home/swang/ray/python/ray/data/datasource/datasource.py", line 331, in prepare_read
E                           input_files=None,
E                       TypeError: __init__() missing 1 required keyword-only argument: 'exec_stats'
```

This PR adds the missing arg.
2022-03-24 13:05:01 -07:00
xwjiang2010
4f34b53e83
[AIR] Add tuner test. (#23364)
Add tuner tests.
   
These tests are mainly focusing on non ray client mode, including successful runs, and failures in both driver and trainer side and resume.
   
One issue surfaced through writing the tests (which probably means the API is not quite right) is whether RunConfig should be supplied in Tuner.init v.s. Tuner.fit(). At least for some fields in RunConfig, we want to be able to change it across runs (e.g. callbacks). Plus with current impl, it's not possible to checkpoint "stateful" callbacks, which could confuse our users. cc @ericl for API inputs. See "test_tuner_with_xgboost_trainer_driver_fail_and_resume" (search for hack).

The PR also cleans up some API docs.

Fixes some bugs in loading trial from checkpoint, namely get_default_resource (which probably is not necessary given self.placement_group_factory is already set anyways) is called with an empty config, as self.config is only loaded through __setstate__, which happens later than get_default_resource. Remove the call to get_default_resource when loading trials from checkpoint.
2022-03-24 14:54:21 +00:00
mwtian
26f1a7ef7d
[Core] Account for spilled objects when reporting object store memory usage (#23425) 2022-03-23 22:25:22 -07:00
Linsong Chu
63d6884509
[workflow]align the behavior of workflow's max_retires with remote function's max_retries (#22903)
To address the issue https://github.com/ray-project/ray/issues/22824

Basically the current behavior of `max_retries` in workflow is different from the one in remote functions in the following ways:
1. workflow's max_retries is not the number of retries, but the number of total tries. 
2. workflow's max_retries does not allow "-1" (infinite retries) while remote function's max_retries does.

This PR altered the behavior of `max_retries` in workflow to be consistent with the `max_retries` in remote functions:
1. make max_retries to be truly max retries (i.e. total tries = original try + max retries)
 - [x] implementation
 - [x] update logging
 - [x] update tests
2. make max_retries accept infinite tries (i.e. `max_retries=-1`)
2022-03-23 22:11:44 -07:00
Eric Liang
38925f60d2
Add a get_if_exists option for simpler creation of named actors (#23344)
Getting or creating a named actor is a common pattern, however it is somewhat esoteric in how to achieve this. Add a utility function and test that it doesn't cause any scary error messages.

Actor.options(name="my_singleton", get_if_exists=True).remote(args)
2022-03-23 22:02:58 -07:00
Dmitri Gekhtman
bc98afcdf8
Test of KubeRay autoscaler integration (#23365)
This PR adds a test of KubeRay autoscaler integration to the Ray CI.

- Tests scaling with autoscaler.sdk.request_resources
- Tests autoscaler response to RayCluster CR change
2022-03-23 18:18:48 -07:00
Simon Mo
5c2ea1d5f4
[Serve][Tests] Deflake by disable test_runtime_env on OSX (#23436)
https://github.com/ray-project/ray/pull/23380 made the test flakier 
<img width="838" alt="image" src="https://user-images.githubusercontent.com/21118851/159805531-d085cd7a-7ffd-45e2-8a2c-cd4984ac2397.png">
2022-03-23 15:29:26 -07:00
Dmitri Gekhtman
f91a134dc6
[core/autoscaler] Restore use_gcs_for_bootstrap (#23413)
Certain external integrations rely on ray._private.use_gcs_for_bootstrap to determine if Ray is using the gcs to bootstrap. The current version of Ray always uses the gcs to bootstrap, so this should just return True.
2022-03-23 10:39:23 +00:00
dependabot[bot]
05bfcdbaf8
[tune](deps): Bump ax-platform in /python/requirements/ml (#23098)
Bumps [ax-platform](https://github.com/facebook/Ax) from 0.2.1 to 0.2.4.
- [Release notes](https://github.com/facebook/Ax/releases)
- [Changelog](https://github.com/facebook/Ax/blob/main/CHANGELOG.md)
- [Commits](https://github.com/facebook/Ax/compare/0.2.1...0.2.4)

---
updated-dependencies:
- dependency-name: ax-platform
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-23 10:13:00 +00:00
Amog Kamsetty
6d776976c1
[Train] Fix multi node horovod bug (#22564)
Closes #20956
2022-03-22 16:22:53 -07:00
Simon Mo
67a4450d69
[Serve] Test that EveryNode configuration allow scale down (#23408) 2022-03-22 16:18:59 -07:00
Archit Kulkarni
04ff0a9398
Deflake serve:test_runtime_env by splitting into two files (#23380)
- Also removes the unnecessary "post-wheel-build" tag, which is only used for conda tests.
2022-03-22 15:47:38 -05:00
Siyuan (Ryans) Zhuang
aea93e4a1f
[tune] simpler get_next_trial (#23396)
* simpler next_trial

* update

* update
2022-03-22 12:27:53 -07:00
Chen Shen
3efe437298
[Dataset] optionally pin pipeline actors on driver node. (#23397)
pin pipeline actors on driver node doesn't work when Ray dataset is used with Ray client.
2022-03-22 10:31:23 -07:00
Matti Picus
b190d7c214
[WINDOWS] skip flaky test that often fails (#23204) 2022-03-22 09:34:15 -07:00
xwjiang2010
587e46611f
[tuner] return new checkpoint in result grid. (#23280)
Closes #23295
Closes #23303
2022-03-22 15:21:53 +00:00
Eric Liang
f31e4099ed
Add storage-based spilling backend (#23341)
Why are these changes needed?
This adds a ray-storage based spilling backend, which can be enabled by setting the spill config to {"type": "ray_storage", "buffer_size": N}. This will cause Ray to spill to the configured storage (pyarrow FS).

In a future PR, I'll add documentation and deprecate the existing smart_open backend.
2022-03-21 19:17:42 -07:00
Richard Liaw
1fe110f8f4
[ml] Add a starter page for docstrings (#23312) 2022-03-21 17:20:45 -07:00
Kai Fricke
9d8ce1db59
[tune] Fix docstrings according to code style guide (#23375)
According to https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#code-style, typed arguments should not repeat thetypes in the docstrings. This PR removes these annotations and adds further annotations in some places where they were missing before.
2022-03-21 17:57:50 +00:00
Amog Kamsetty
54d1f6c704
[CI] Better error message for fake multi-node cluster (#22647)
Differentiate between a "resources not available" error vs. other types of errors.

Had this happen to me when I was trying out the fake cluster- I was using Ray client incorrectly, but because we were doing a generic except Exception, this was raised as "Timed out waiting for resources"
2022-03-21 17:35:56 +00:00
Edward Oakes
f22a34bd4f
Restore "[Serve] Implement Default DAGDriver (#23301)" (#23373) 2022-03-21 10:35:00 -07:00
Kai Fricke
b64452bc63
[tune] Add multinode sync test (#23229)
This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.
2022-03-21 17:02:17 +00:00
Guyang Song
69af9764b2
[runtime env] URI reference refactor (#22828)
- Move the URI reference logic from raylet to agent.
- Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible`
- More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528

Future works
- We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this.
- Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count.
- Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?
2022-03-21 11:21:15 -05:00
Stephanie Wang
e507aa5758
Revert "[Serve] Implement Default DAGDriver (#23301)" (#23358)
This reverts commit 91a1c3411f.
2022-03-21 10:54:52 -05:00
Larry
81dcf9ff35
[Placement Group] Make PlacementGroupID generate from JobID (#23175) 2022-03-21 17:09:16 +08:00
Avnish Narayan
e008a48ef2
[release tests] Pin gym everywhere (#23349) 2022-03-19 02:52:54 -07:00
Philipp Moritz
886cc4d674
Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
Simon Mo
91a1c3411f
[Serve] Implement Default DAGDriver (#23301) 2022-03-18 18:07:39 -07:00
Siyuan (Ryans) Zhuang
65cc877ad8
[workflow] Ensure that DAGs are dereferenced like ObjectRefs in Ray tasks (#23320) 2022-03-18 17:02:15 -07:00
Jiao
9b38b6de47
[Serve] [Pipeline] Default all DeploymentNode route_prefix to None, and "/" for the root driver (#23289) 2022-03-18 16:56:49 -07:00
shrekris-anyscale
c668039020
[serve] Restore "Get new handle to controller if killed" (#23283) (#23338)
#23336 reverted #23283. #23283 did pass CI before merging. However, when it merged, it began to fail because it used commands that were outdated on the Master branch in `test_cli.py` (specifically `serve info` instead of `serve config`). This change restores #23283 and updates its tests commands.
2022-03-18 18:40:08 -05:00
Jiao
49e0ab2f58
[Serve] [Pipeline] Use ServeSchema for deployment prevent config got overridden (#23324) 2022-03-18 15:25:32 -07:00
mwtian
909cdea3cd
[Python Worker] add feature flag to support forking from workers (#23260)
Make sure Python dependencies can be imported on demand, without the background importer thread. Use cases are:

If the pubsub notification for a new export is lost, importing can still be done.
Allow not running the background importer thread, without affecting Ray's functionalities.
Add a feature flag to support forking from Python workers, by

Enable fork support in gRPC.
Disable importer thread and only leave the main thread in the Python worker. The importer thread will not run after forking anyway.
2022-03-18 14:47:18 -07:00
Junwen Yao
8fff665455
[Train] Add torch data prefetch benchmark example (#22974)
Add a benchmark example for the auto pipeline functionality for host to device data transfer.
2022-03-18 13:27:26 -07:00
Eric Liang
c4b52d34ca
Initial PR for internal storage API (#22889) 2022-03-18 12:32:40 -07:00