Commit graph

10425 commits

Author SHA1 Message Date
Simon Mo
32a4f48aa2
[CI] Don't test tune dashboard (#20452) 2021-11-16 15:07:56 -08:00
Richard Liaw
cf357f6bce
[docs] Add a talks section for ray.data (#20444) 2021-11-16 14:30:08 -08:00
Kai Fricke
05d21497db
[rllib/tune] Fix durable trainable in trainer template, add release test (#20422) 2021-11-16 20:52:42 +00:00
Edward Oakes
48b87d5830
[serve] Fix actor resources error in failure test (#20400) 2021-11-16 12:24:54 -08:00
Eric Liang
12a4489e30
Revert "[core] Nested task support via task depth + backpressure" (#20438)
Reverts ray-project/ray#17887

This causing several tests to be flaky (test_multinode_failures, test_virtual_actor, test_component_failures_2).
2021-11-16 11:14:45 -08:00
gjoliver
6e787f70e0
[Rllib/release] Disable throughput check (#20387)
Throughput check was enabled by d8a61f801f prematurely.
E.g., see state before the commit:
a931076f59/rllib/utils/test_utils.py (L740-L741)
2021-11-16 11:05:51 -08:00
Chen Shen
33c1ee0e86
[Core][actor out-of-order execution 5/n] implement out-of-order scheduling queue #20176
This PR belongs to the stack that enables out of order execution. Previous PR: #20160, Next PR: #20177

In this PR specifically, we implemented a simple out_of_order_scheduling queue which queues the task for execution as soon as the dependency is ready.
2021-11-16 10:53:51 -08:00
Chen Shen
f02b53a810
[Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue (#20150)
Why are these changes needed?
This is the third PR in the stack that supports out or order execution for threaded/async actors. Previous PR #20149 Next PR #20160
At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion.

In this PR, we implemented the out-of-order of queue that supports out of order execution. Conceptually it's very simple: it sends the requests as soon as the dependency is resolved.
2021-11-16 10:48:49 -08:00
Simon Mo
5f2b035bba
Pin Redis version to < 4.0.0 (#20430)
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

This pin is needed to fix `test_output` on master, which broke when 4.0.0 was released. 

It may also fix the windows build (unsure). 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-16 10:48:36 -08:00
Alex Wu
8f21cdbddb
Revert "[dependencies] Use redis[hiredis] in setup.py" (#20435)
Reverts ray-project/ray#20423

`hiredis` will break our M1 support right now.
2021-11-16 10:46:22 -08:00
Kai Fricke
6ec256122c
[dependencies] Use redis[hiredis] in setup.py (#20423)
This is recommended by `redis-py` and as a side effect gets rid of a current error in `test_output` for the minimal dependency test (e.g. https://buildkite.com/ray-project/ray-builders-branch/builds/4746#7444b5d0-87c3-4998-b722-1cbc2d9fe7e3)
2021-11-16 10:25:36 -08:00
Amog Kamsetty
7e597814aa
[Release] Fix app config for horovod_tests (#20393)
Fixes `horovod_test` weekly test

Closes https://github.com/ray-project/ray/issues/20382
2021-11-16 09:06:42 -08:00
Antoni Baum
3f9ded55f7
[tune] Merge Analysis into ExperimentAnalysis (#20197)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-11-16 16:47:12 +00:00
Antoni Baum
c097f64c79
[tune] Drop 0 value keys from PGF (#20279)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-11-16 16:44:09 +00:00
Amog Kamsetty
4f88796d5a
[Train] Move to beta (#20378) 2021-11-16 08:19:30 -08:00
Simon Mo
ca90c63483
[Serve] Add serve failure test to CI (#20392) 2021-11-16 08:12:08 -08:00
Kai Fricke
693063d6f8
[ci/release] fix exit code (use value, not object) (#20427) 2021-11-16 15:15:39 +00:00
Kai Fricke
8a6c936aa8
[tune] Fix syncer=None not disabling trial-to-driver syncing (#20418) 2021-11-16 14:36:23 +00:00
Sven Mika
f82880eda1
Revert "Revert [RLlib] POC: Deprecate build_policy (policy template) for torch only; PPOTorchPolicy (#20061) (#20399)" (#20417)
This reverts commit 90dc5460d4.
2021-11-16 14:49:41 +01:00
Qing Wang
6504ad6bb2
[xlang] Add named actor xlang tests. (#20368)
We add named actor xlang tests, including both getting java named actor in python and get python named actor in Java.

Related issue number
#19794
2021-11-16 21:42:05 +08:00
SangBin Cho
5ec63ccc5f
[Regresion test] Placement group long running test (#20251)
Why are these changes needed?
In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed.

This PR adds a long running test for the placement group. There are 2 purposes of the test.

Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round.
Make sure placement group removal / creation works consistently for a long time without an issue.
Q: Should we make it a real long running test? (that runs for a day?)
2021-11-16 04:21:18 -08:00
SangBin Cho
137aec04c0
[Core] Better logs job message failure (#20363)
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

There's one user who has an issue that one of raylets cannot schedule tasks anymore because `num_worker_not_started_by_job_config_not_exist ` > 0.

This PR adds better log messages to figure out if the root cause is the job information is not properly propagated from GCS to raylet through Redis pubsub. 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-16 04:20:49 -08:00
Stefan Schneider
2b3d0c691f
[RLlib] Document and extend action mask example. (#20390)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-16 13:20:41 +01:00
Kai Fricke
3e6ba5d6d2
Revert "Revert [RLlib] POC: PGTrainer class that works by sub-classing, not trainer_template.py." (#20285)
* Revert "Revert "[RLlib] POC: `PGTrainer` class that works by sub-classing, not `trainer_template.py`. (#20055)" (#20284)"
This reverts commit 246787cdd9.
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-16 12:26:47 +01:00
Eric Liang
a1d78088e6
[Hotfix] Fix flaky test_basic_workflows_2
This test seems very flaky after the nested tasks PR merged. I think it's since the test is broken, and we made one of the branches more likely.
2021-11-16 01:11:36 -08:00
Yi Cheng
a4e187c0e7
[gcs] Update function table to use internal kv (#20152)
## Why are these changes needed?
This is a part of redis removal. This PR remove redis kv in function table. 
rpush related code is not updated in this PR.

## Related issue number
2021-11-15 23:34:41 -08:00
Eric Liang
460cf86858
Split blocks automatically into 500MB chunks on file read and transformation (#20235)
This PR adds support for automatic block splitting on read and map transforms, to keep block size bounded to ~500MiB. This avoids potential OOM situations where a map task may consume too much intermediate Python heap memory, or too much object store shared memory for one block.
2021-11-15 22:25:11 -08:00
Siyuan (Ryans) Zhuang
3e9cd4248e
[workflow] Refactoring workflow to make it easier to follow the logic (#20349)
* update

* cleanup
2021-11-15 21:02:33 -08:00
Yiran Wang
f4e8319eaa
Remove .boto files that are no longer needed during docker build (#20407)
## Why are these changes needed?

The .boto files are already added to the base image and ACL'ed to root, adding them again during app config build causes permission issues.

## Related issue number
2021-11-15 20:49:33 -08:00
Stephanie Wang
31eb385426
Revert "Revert "Revert "[core] Fail objects when pull/reconstruction hangs (#19789)" (#19904)" (#20120)" (#20406)
This reverts commit 0f57a9a105.
2021-11-15 20:36:22 -08:00
Alex Wu
75f421a3fd
[core] Nested task support via task depth + backpressure (#17887)
* needs depth

* depth

* .

* .

* .

* lint

* .

* lint

* fix tests

* .

* .

* .

* .

* cleanup

* .

* tests

* .

* more tests

* fix rest(?) of tests

* cleanup

* .

* .

* .

* .

* lint

* fix test basic

* fix ref counting?

* cleanup

* lint

* .

* pass dataset pipeline test

* .

* stephanie's comments + fix tests

* cleanup

* cleanup

* minor cleanup, then fix merge conflict

* lint

* cast

* feature flag

* lint

* lint

* refactor

* needs cleanup

* should pass

* lint

* .

* .

* .

* work?

* .

* works?

* lint

* work?

* .

* fix cpp tests

* .

* .

* split test

* fix windows?

* fix windows?

* fix test + check

* .

* all passing

* tests

* lint

* cleanup

* .

* most stephanie ocmments

* lint

* remove timer

* .

* allowed - capacity

* .

* everything except barrier

* addd guard

* works

* lint

* works?

* debug string

* last comment?

* short comments

* most comments

* lint

* done?

* done?

* .

* .

* .

* .

* done?

* done?

* update

* lint

* fix last test

* .

* .

* .

* .

* .

* .

* debug

* .

* .

* .

* .

* fix type

* .

* .

* cleanup

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-11-15 17:39:50 -08:00
Edward Oakes
48bc1af2da
[job submission] Remove DOES_NOT_EXIST status (#20354) 2021-11-15 16:57:32 -08:00
mwtian
1dd8b3d2bc
[Build] Remove debug info from Ray libraries. (#20389)
## Why are these changes needed?
Ray wheel size limit is still at 100MB. Removing debug symbols would decrease Ray Linux wheel sizes.

## Related issue number

## Checks
2021-11-15 16:40:48 -08:00
Amog Kamsetty
90dc5460d4
Revert "[RLlib] POC: Deprecate build_policy (policy template) for torch only; PPOTorchPolicy (#20061)" (#20399)
This reverts commit 5b1c8e46e1.
2021-11-15 16:11:35 -08:00
iasoon
171ad62e30
Link to the documentation on contributing from CONTRIBUTING.rst (#19396)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-11-15 15:34:18 -08:00
Antoni Baum
ec81f52061
[Docs] Fix typo in C++ Placement Group example (#20386) 2021-11-16 08:19:09 +09:00
matthewdeng
35dc3cf21b
[train] fix Train/Tune integration on Client (#20351)
* [train] fix Train/Tune integration on Client

* remove force_on_current_node
2021-11-15 14:36:33 -08:00
Alex Wu
884bb3de33
[Dataset] Bump numpy >=1.20 dependency (#20374)
* done?

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-11-15 14:10:00 -08:00
Kai Fricke
d191ad2de8
[ci/release] Return exit codes based on different errors (#20289) 2021-11-15 19:41:00 +00:00
Simon Mo
72ae22e82b
[CI] Fix frontend build issue (#20375) 2021-11-15 10:12:43 -08:00
Kai Fricke
91920f1d02
[release/xgboost] xgboost release test fixes via app config (#20325)
* [xgboost] Fix release test app configs

* Revert full app config

* Update base docker image

* Only change cpu base image

* default

* Pin xgboost to 1.5. in cpu tests

* Remove numpy hack

* Revert one line

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-15 10:03:21 -08:00
Amog Kamsetty
ef7967476c
[Train] Torch data transfer automatic conversion (#20333)
* update

* formatting

* fix failures

* fix session tests

* address comments

* add to api docs

* package refactor

* wip

* wip

* wip

* finish

* finish

* fix

* comment

* fix

* install horovod for docs

* address comment

* Update python/ray/train/session.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/train/torch.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* try fix docs

* fix doc build failure

* wip

* fix

* fix

* fix

* try fix doc highlighting

* fix docs

* finish

* formatting

* address comments and fix tests

* address comments and fix test

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-11-15 09:14:12 -08:00
Lixin Wei
85dbda8cf1
[Core] Fix Crash in Debug Log (#20322)
* fix crash in debug log

* fix

* fix

* fix
2021-11-16 00:45:00 +09:00
Lixin Wei
b7e35acf14
[RuntimeEnv] Raise RuntimeEnvSetupError when Actor Creation Failed due to It (#19888)
* ray_pkg passed

* fix

* fix typo

* fix test

* fix test

* fix test

* fix

* draft

* compile OK

* lint

* fix

* lint

* fix ci

* Update src/ray/gcs/gcs_server/gcs_actor_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* remove comment

* rename

* resolve conflict

* use unique ownership

* use DestroyActor instead of ReconstructActor

* fix sigment fault

* fix crash in debug log

* Revert "fix crash in debug log"

This reverts commit 8f0e3d37f062b664d8d0e07c6c1a9a715b8ba1ee.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-11-15 07:43:35 -08:00
Will Drevo
fa878e2d4d
Added example to user guide for cloud checkpointing (#20045)
Co-authored-by: will <will@anyscale.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-11-15 15:43:06 +00:00
Sven Mika
6ff4061f3a
[RLlib] Issue 20269: Offline RL example not working due to new_obs not being written to file. (#20366)
* wip.

* Apply suggestions from code review
2021-11-15 16:41:08 +01:00
Amog Kamsetty
a74cf7ff1c
[Train] Torch Prepare utilities (#20254)
* update

* formatting

* fix failures

* fix session tests

* address comments

* add to api docs

* package refactor

* wip

* wip

* wip

* finish

* finish

* fix

* comment

* fix

* install horovod for docs

* address comment

* Update python/ray/train/session.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/train/torch.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* try fix docs

* fix doc build failure

* fix

* fix

* fix

* try fix doc highlighting

* fix docs

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-11-15 07:34:17 -08:00
matthewdeng
ed3cbe48f5
[train][xgboost][release] fix ml_user_tests using ray client (#20345) 2021-11-15 15:24:23 +00:00
Kai Fricke
4300039d01
[ci/release] Display commit hash in buildkite overview (#20323) 2021-11-15 10:09:04 +00:00
Sven Mika
5b1c8e46e1
[RLlib] POC: Deprecate build_policy (policy template) for torch only; PPOTorchPolicy (#20061) 2021-11-15 10:41:54 +01:00