Commit graph

10433 commits

Author SHA1 Message Date
DK.Pino
2c41936a39
[Placement Group] [Test] Fix pg.ready hang forever when gcs restarting (#20063)
* fixed

* lint

* fix comment

* revert previous fix code
2021-11-10 00:53:42 -08:00
Tao Wang
507bd9186b
[Core]Make convertion between ray/grpc status more specific (#20047)
* [Core]Make convertion between ray/grpc status more specific

* per comments

* lint

* per comments

* use ABORT instead of UNKNOWN, add some tests

* lint

* lint
2021-11-10 00:48:05 -08:00
SangBin Cho
3bae6b94b3
[test] Fix flaky chaos_test.py (#20202)
* Fix

* fix lint
2021-11-10 00:23:55 -08:00
Edward Oakes
5475bb054c
[job submission] Redirect stdout + stderr to a single log file (#20208) 2021-11-09 22:34:12 -08:00
Jiajun Yao
5ffa0bb01f
Listen on 127.0.0.1 if node ip is 127.0.0.1 (#20190) 2021-11-09 20:24:05 -08:00
Sungho Joo
dc51af798c
[RLlib] Minor fix on json encoding during worker sampling (#20134)
* import custom json encoder from util and improve encoder default function

* linting
2021-11-09 16:46:41 -08:00
matthewdeng
33af739bf2
[train] add placement group support (#20091)
* [train] add placement group support

* fix additional resources

* fix tests

* add comment to add_workers
2021-11-09 16:36:07 -08:00
Edward Oakes
f6399e3389
[job submission] Remove jobs intermediate directory for logs (#20192) 2021-11-09 16:20:40 -08:00
Kim Pevey
82a5bf68fa
[Docs] Add note for multi-node on Windows (#20184)
* add note for multi-node on Windows

* update message

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-11-09 16:02:01 -08:00
Guyang Song
6bb0659010
[core][bugfix] fix print ref count for an erased iterator (#20138) 2021-11-09 14:50:25 -08:00
Edward Oakes
39b3eb9763
[serve] Don't halt main control loop due to exceptions in snapshot logic (#20151) 2021-11-09 14:46:15 -08:00
Simon Mo
215f47bc53
[CI] Move Serve nightly tests to a separate suite (#20194)
So we can run them via separate cronjobs
2021-11-09 13:22:50 -08:00
Zyiqin-Miranda
333d0b43fd
[autoscaler] AWS Autoscaler CloudWatch Integration (#18619) 2021-11-09 11:48:55 -08:00
Kim Pevey
bbacb6d828
[Docs] Streaming MapReduce: Remove breaking city (#20187) 2021-11-09 11:15:57 -08:00
SangBin Cho
90fd38c64a
[Test] Large scale threaded actor workload (#20105)
* Done

* Addressed code review.

* lint

* Update release/nightly_tests/stress_tests/test_threaded_actors.py

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2021-11-09 02:28:48 -08:00
SangBin Cho
b0550aa440
[Core] Fix the named actor get or create race condition (#20126)
* Fix done.

* Fixed.

* clean up

* Done
2021-11-09 02:27:54 -08:00
Tao Wang
60df705b4e
[Cpp]Get next job id globally instead of random selecting (#20102)
## Why are these changes needed?

## Related issue number
Final part of #13984

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-09 15:46:57 +08:00
Edward Oakes
c04e5af1eb
[job submission] Rename log files to job-driver-{job_id}.{out,err} (#20170) 2021-11-08 23:10:56 -08:00
Edward Oakes
50f2cf8a74
[job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164) 2021-11-08 23:10:27 -08:00
Jiao
d46caa9856
[job submission] Remove test_utils dependency (#20168)
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-11-08 23:08:43 -08:00
Lingxuan Zuo
97259e33b2
Relink grpc/absl for streaming.so (#20136)
To avoid exporting thrirdparty library symbol globally, these absl/grpc libs have been applied in _streaming.so.

Side-effect:
Static variables might be uninitialized if core worker lib and streaming lib both use them.
2021-11-09 14:13:53 +08:00
SangBin Cho
5c4fb4dc91
[Core]Chaos testing nightly (#20059)
* Done initial stage.

* lint

* .

* Finished.

* Fix lint
2021-11-08 21:57:53 -08:00
Stephanie Wang
ffcc5935d7
[core] Evict lineage to bound memory usage (#19946)
* bound lineage

* Bound lineage in bytes

* test

* Lineage evicted error

* Lineage evicted

* lint

* test

* test

* comment

* doc

* x

* x

* x

* x
2021-11-08 21:53:40 -08:00
architkulkarni
e5e62d8991
[runtime env] Fix runtime env conda test and enable it in CI (#20121) 2021-11-08 18:33:19 -08:00
Lixin Wei
8e666ca1e9
[Core] Fix Used Memory Calculation (#20127)
* fix memory

* fix
2021-11-08 17:36:32 -08:00
Kai Fricke
9c2b8c8501
[tune] Deprecate DurableTrainable (#19880) 2021-11-08 20:56:07 +00:00
Amog Kamsetty
f8430e6eca
[CI] Pin shortuuid to fix CI (#20153) 2021-11-08 12:08:32 -08:00
gjoliver
d8a61f801f
[RLlib] Create a set of performance benchmark tests to run nightly. (#19945)
* Create a core set of algorithms tests to run nightly.

* Run release tests under tf, tf2, and torch frameworks.

* Fix

* Add eager_tracing option for tf2 framework.

* make sure core tests can run in parallel.

* cql

* Report progress while running nightly/weekly tests.

* Innclude SAC in nightly lineup.

* Revert changes to learning_tests

* rebrand to performance test.

* update build_pipeline.py with new performance_tests name.

* Record stats.

* bug fix, need to populate experiments dict.

* Alphabetize yaml files.

* Allow specifying frameworks. And do not run tf2 by default.

* remove some debugging code.

* fix

* Undo testing changes.

* Do not run CQL regression for now.

* LINT.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-08 18:15:13 +01:00
Amog Kamsetty
b1f24768a1
[Tune] More fixes to PTL Tutorial (#20065)
* ptl-fix-2

* improve

* fix
2021-11-08 09:13:44 -08:00
Gagandeep Singh
31812d026c
Bumped time limit for test_worker_startup_count in test_basic_3.py (#20056)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-11-08 09:02:28 -08:00
Sven Mika
eea6b40a3e
[RLlib] Minor cleanups in Trainer; better tf/tf2 info messages about possible tracing speedups. (#20109) 2021-11-08 15:37:27 +01:00
Kai Yang
e84391d1d3
[Core] Encode job ID in randomized task IDs for user-created threads (#19320)
## Why are these changes needed?

Currently, when `WorkerContext::GetCurrentTaskID()` returns a random task ID in user-created threads, and the returned task ID doesn't include the job ID. In this case, subsequent non-actor tasks and return values, and objects created by `ray.put()` don't include the job ID neither. This makes us hard to find the correct job ID from a task or object ID.

This PR updates the task ID generation code to always encode the job ID.

A side-effect of this PR is the change of possibility of task ID collision in user-created threads due to the fixed job ID part. w/o this PR: `sqrt(pi * 256 ^ 12 / 2)` ~= 352 trillion tasks. w/ this PR: `sqrt(pi * 256 ^ 8 / 2)` ~= 5 billion tasks. But this should be OK because the job ID part of task IDs in non-user-created threads are always fixed, so it won't be worse than non-user-created threads.

## Related issue number

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-08 21:00:40 +08:00
qicosmos
547bfbc4a4
[core]Simplify filesystem (#18941) 2021-11-08 17:54:07 +08:00
Qing Wang
f9d94f51aa
Revert "[Java] Skip javadoc when deploying. (#19428)" (#20137)
This reverts commit 1047914ee0.
2021-11-08 15:53:31 +08:00
Qing Wang
6d8a7291ab
Add getNamespace API for Java worker (#20057)
[Java API] Add getNamespace API for Java worker.
2021-11-08 15:51:14 +08:00
Linsong Chu
e189d8d4bc
[workflow] fix s3 storage path (#20115)
## Why are these changes needed?

To fix two path related issues when s3 is used as storage backend:
1.  a leading slash will be added to the path due to the behavior of `parse.urlparse`.
2. When `step_id=""`, double slashes will be added in the path.

Details are explained in https://github.com/ray-project/ray/issues/20114

## Related issue number

https://github.com/ray-project/ray/issues/20114
https://github.com/ray-project/ray/issues/19027
2021-11-07 15:57:33 -08:00
xwjiang2010
99826d2ca6
[Release] Increase node memory by 2X in many_ppo test. (#19591) 2021-11-08 08:10:09 +09:00
Jiajun Yao
e110d958a1
Support different s3 url formats (#20133) 2021-11-07 14:58:51 -08:00
Jules S. Damji
e6343f0e69
Fixed a broken code snippet with a missing method (#20130)
Signed-off-by: Jules S.Damji <jules@anyscale.com>

Co-authored-by: Jules S.Damji <jules@anyscale.com>
2021-11-08 07:56:32 +09:00
dependabot[bot]
adf39941f4
[data](deps): Bump dask[complete] (#20125)
Bumps [dask[complete]](https://github.com/dask/dask) from 2021.9.1 to 2021.11.0.
- [Release notes](https://github.com/dask/dask/releases)
- [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md)
- [Commits](https://github.com/dask/dask/compare/2021.09.1...2021.11.0)

---
updated-dependencies:
- dependency-name: dask[complete]
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-07 11:55:39 -08:00
Alex Wu
45d7ef7c08
[windows][ci] Skip test_multi_node_failure_2 (#20117) 2021-11-07 09:17:46 -08:00
Sven Mika
76f8a9f125
[RLlib; testing] Increase size of two time-out'ing test cases from medium to large. (#20128) 2021-11-06 21:48:28 +01:00
Jiao
9ef75b27ac
[Job Submission] Add stop API to http & sdk, with better status code + stacktrace (#20094) 2021-11-06 12:37:54 -05:00
SangBin Cho
7a18d90a25
Revert "[scheduler] Update local object store usage (#20026)" (#20118)
This reverts commit 7e013366ac.
2021-11-06 07:34:27 -07:00
SangBin Cho
f65cc72b4c
Revert "Set default max_pending_lease_requests_per_scheduling_category to 10 (#19924)" (#20124)
This reverts commit 0d850f3302.
2021-11-05 23:35:30 -07:00
Yi Cheng
6a6cc434ba
[nightly] Remove grpc staging test since nightly is stable #20119 (#20119) 2021-11-05 21:36:58 -07:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00
Alex Wu
81194f5660
[workflow][docs] Fix api comparison formatting (#20069)
## Why are these changes needed?

The API comparison formatting uses \`code\` which is rendered as italicization not code. This PR puts the code in code blocks instead of italics. 
## Related issue number

## Checks
2021-11-05 17:05:35 -07:00
mwtian
4d70ce1c86
[Core][Pubsub] add worker failure message to gcs pubsub (#20075)
## Why are these changes needed?
This is to demonstrate the steps needed to add a GCS pubsub channel, with GCS publisher and C++ subscribers subscribing via GCS client. For new channels, a unit test exercising the publishing and subscribing logic should also be added to `gcs_client_test.cc`.


## Related issue number
2021-11-05 14:52:49 -07:00
Jiajun Yao
0d850f3302
Set default max_pending_lease_requests_per_scheduling_category to 10 (#19924) 2021-11-05 14:24:20 -07:00