Commit graph

10204 commits

Author SHA1 Message Date
SangBin Cho
857f23652f
Add more shuffle tests to CI (#17684)
* IP

* done

* done
2021-11-02 08:07:59 -07:00
SangBin Cho
563eb0bca2
[Runtime env] Add a test to make sure resource deadlock message is not printed when waiting for workers (#19870)
* ip

* Add a runtime env resource deadlock msg test

* Fix a bug

* Skip on windows
2021-11-02 07:48:55 -07:00
Sven Mika
2d24ef0d32
[RLlib] Add all simple learning tests as framework=tf2. (#19273)
* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and Tune tests have
been moved to python 3.7

* fix tune test_sampler::testSampleBoundsAx

* fix re-install ray for py3.7 tests

Co-authored-by: avnishn <avnishn@uw.edu>
2021-11-02 12:10:17 +01:00
Will Drevo
97f04b118d
[RLlib; Docs] Added fixes to CartPole example. (#19908)
* Added fixes to CartPole example

* Apply suggestions from code review

Co-authored-by: will <will@anyscale.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
2021-11-02 10:06:39 +01:00
Simon Mo
6040319d02
[CI] Pin aiohttp version to fix master branch (#19948) 2021-11-01 23:00:08 -07:00
Qing Wang
da6894848d
Support Java namespace APIs (#19468)
## Why are these changes needed?

## Related issue number
#16474
2021-11-02 11:05:40 +08:00
Kai Yang
a33466e905
[Core] Fail inflight tasks on actor restarting (#19354)
## Why are these changes needed?

If an actor failover is triggered, but the RPC connection between the caller and the crashed actor instance is not disconnected automatically, subsequent tasks to the new actor instance may not be executed. The root cause is that the sequence numbers of tasks sent to the new actor instance is not starting from 0. Details can be found in #14727.

This PR fixes it by ensuring all inflight actor tasks fail immediately when actor failover is detected (via actor state notifications).

## Related issue number

closes #14727
2021-11-02 11:03:12 +08:00
Yi Cheng
a907168184
[core] Fix wrong local resource view in raylet (#19911)
## Why are these changes needed?
When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent.

1. local node has used all some pg resource
2. gcs broadcast node resources
3. local node now have resources
4. scheduler picks local node
5. local node can't schedule the task
6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs

## Related issue number
#19438
2021-11-01 19:52:03 -07:00
xwjiang2010
c48d86e469
[CI] change git protocol to use https. (#19964) 2021-11-01 19:38:58 -07:00
Amog Kamsetty
3a52187da8
[Release/Lightning] Add Ray lightning user test (#19812)
* wip

* wip

* add ray lightning test

* fix

* update

* merge and add

* fix

* fix

* rename

* autoscale

* add tblib

* gloo backend

* typo

* upgrade torch

* latest and master
2021-11-01 18:29:48 -07:00
Amog Kamsetty
474e44f7e0
[Release/Horovod] Add user test for Horovod (#19661)
* infra

* wip

* add test

* typo

* typo

* update

* rename

* fix

* full path

* formatting

* reorder

* update

* update

* Update release/horovod_tests/workloads/horovod_user_test.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* bump num_workers

* update installs

* try

* add pip_packages

* min_workers

* fix

* bump pg timeout

* Fix symlink

* fix

* fix

* cmake

* fix

* pin filelock

* final

* update

* fix

* Update release/horovod_tests/workloads/horovod_user_test.py

* fix

* fix

* separate compute template

* test latest and master

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-11-01 18:28:07 -07:00
matthewdeng
e1e4a45b8d
[train] add simple Ray Train release tests (#19817)
* [train] add simple Ray Train release tests

* simplify tests

* update

* driver requirements

* move to test

* remove connect

* fix

* fix

* fix torch

* gpu

* add assert

* remove assert

* use gloo backend

* fix

* finish

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-01 18:25:19 -07:00
Jiajun Yao
05c63f0208
[workflow] Mark workflow test_recovery as large test (#19950)
## Why are these changes needed?
move test_recovery to large test
## Related issue number
2021-11-01 15:50:38 -07:00
Sven Mika
0b308719f8
[RLlib; Docs overhaul] Docstring cleanup: rllib/utils (#19829) 2021-11-01 21:46:02 +01:00
Sven Mika
bab9c0f670
[RLlib; Docs overhaul] Redo: Docstring cleanup: Trainer, trainer_template, Callbacks."" (#19830) 2021-11-01 21:45:11 +01:00
Alex Wu
80fb3f10ae
[ci] Script for building M1 wheels (#19925)
This PR includes a script for building wheels for Macs with M1 processors. It roughly follows the pattern of the other scripts with a few differences.

Manually installs nvm
Uses miniforge conda to install python/pip instead of python foundation .pkgs
Doesn't pin numpy (we probably shouldn't be pinning it in the other scripts either...)
Commit detection falls back to git instead of erroring
All of these changes were made so that the script works on a laptop, which comes with a subset of the dependencies that the x86 buildkite image comes with.
2021-11-01 11:44:59 -07:00
xwjiang2010
1803ca13b6
Adding release logs for 1.8.0. (#19867) 2021-11-01 10:26:04 -07:00
Hao Zhang
a03c4363b5
[Collective] Allow send/recv partial tensors in Send/Recv primitives (#19921) 2021-11-01 10:25:43 -07:00
Edward Oakes
ee57025be6
[serve] Rename BackendConfig -> DeploymentConfig (#19923) 2021-11-01 10:24:02 -07:00
Sven Mika
ea2bea7e30
[RLlib; Docs overhaul] Docstring cleanup: Offline. (#19808) 2021-11-01 10:59:53 +01:00
Tao Wang
7a2e9e00e8
[Tiny]Remove duplicated assignment (#19866) 2021-11-01 11:44:01 +08:00
mwtian
cb8dc5c94e
Fix unused import warning in streaming.proto (#19912)
## Why are these changes needed?
This generates a warning when calling `protoc` on the proto.

## Related issue number
2021-10-31 13:29:51 -07:00
architkulkarni
702bffe072
[runtime env] [test] Enable runtime env nightly test with working_dir reconnection (#19906) 2021-10-31 10:48:48 -05:00
architkulkarni
de8a9b5151
[runtime env] Always print package pushing logs regardless of size (#19897) 2021-10-31 10:47:37 -05:00
Edward Oakes
e507b7ba6e
[serve] Rename BackendVersion -> DeploymentVersion (#19798) 2021-10-31 10:27:19 -05:00
Chen Shen
961742f8e7
[Core] deflake windows test failure (test_task_retry_mini_integration) #19916 2021-10-30 15:13:38 -07:00
xwjiang2010
4d293c4cee
Increase horovod_test disk space. (#19917) 2021-10-30 14:41:31 -07:00
Sven Mika
4d945fe651
[RLlib] Issue 19878: Re-instate bare_metal_policy example script (#19881) 2021-10-30 12:50:39 -07:00
Stephanie Wang
630a8cacb3
Revert "[core] Fail objects when pull/reconstruction hangs (#19789)" (#19904)
This reverts commit e6d60d7376.
2021-10-30 10:54:39 -07:00
Kim Pevey
3ff4fde0f5
[Doc] Update newsreader example (#19893) 2021-10-29 22:25:40 -07:00
Kim Pevey
8aa61566fa
[Doc] Example docs minor wording fixes (#19890) 2021-10-29 22:15:35 -07:00
Kim Pevey
96480d97d6
[DOC] Minor typos/fixes to Tips for First Timers (#19887)
* fix typos

* some more fixes

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-10-29 22:13:15 -07:00
mwtian
7afdfdc6dd
[CI] narrow down tests that run when files change (#19656) 2021-10-29 16:47:54 -07:00
mwtian
d32facdef8
[Doc][Bazel] add comment to not use Bazel test result cache (#19842)
To avoid confusions in future, add a comment about why Ray is not using Bazel test result cache.
2021-10-29 16:46:22 -07:00
chenk008
57363995f3
[runtime env] Move container related code to runtime env (#19067) 2021-10-29 16:31:11 -07:00
Jiao
bb0ebb7903
[job submission] Temporarily make pydantic imports conditional (#19827) 2021-10-29 18:09:18 -05:00
Gagandeep Singh
f549e528c7
Bumped time limit in test_cancel::test_comprehensive (#19871) 2021-10-29 15:51:49 -07:00
SangBin Cho
99b5932d06
Add a simple node failure integration test + clean up spammy logs upon node failures (#19695)
* .

* Done

* clean up

* lint

* fix a bug

* lint

* fix issue

* Remove no-op from StartRayLog

* Addressed code review.
2021-10-29 18:42:35 -04:00
architkulkarni
16d3afc665
[serve] Base autoscaling decisions on target num replicas, not current num replicas (#19869) 2021-10-29 17:03:53 -05:00
Eric Liang
456d73754a
[data] Initial pass at support multiple-block returns for read and transform tasks (#19660) 2021-10-29 14:21:56 -07:00
SangBin Cho
f2b831f50f
[Placement Group] Fix the implicit value change from uint32_t -> uint64_t for pg scheduling retry (#19882)
* .

* done

* done
2021-10-29 12:16:53 -07:00
Philipp Moritz
0a5942d8b0
[Documentation] Fix quotes for windows installations (#19859)
* [Documentation] Fix quotes for windows installations

* update

* formatting
2021-10-29 10:54:38 -07:00
Lixin Wei
1fe9f3372e
[Nightly Test] Remove duplicate printing code (#19874)
## Why are these changes needed?

Remove duplicate printing code
2021-10-29 10:19:19 -07:00
Lixin Wei
56301e34b2
[Refactor] Remove ServiceBased Abstraction (#19694)
## Why are these changes needed?

Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}

class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```

However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.

I think this abstraction is not necessary and will make development hard(I have to modify two files every time).

This PR removes all ServiceBasedXxx and moves its implementations to the base class.

Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
2021-10-29 10:16:14 -07:00
Gagandeep Singh
9460a5375b
Added retry logic in test_basic::test_ray_options (#19832)
* Added retry logic in test_ray_options

* Applied linting format

* Made test consistent
2021-10-29 10:15:12 -07:00
architkulkarni
fdefd875c3
[Doc] [runtime env] Move runtime env section up one level, add inbound links (#19863) 2021-10-29 12:02:39 -05:00
SangBin Cho
4586ced5e4
Limit the max number of resource usage print (#19828)
* done

* done

* addressed code review

* done
2021-10-29 07:24:14 -07:00
Edward Oakes
bf23a31017
[job submission] Always generate and return job_id (#19851) 2021-10-29 09:09:54 -05:00
SangBin Cho
16dcff4091
[Core/RuntimeEnv] Fix runtime environment hanging issues. (#19823)
* done

* Add a right test

* Fix unit tests

* fix issues
2021-10-29 07:01:56 -07:00
Kai Fricke
fa0158abe5
[tune] Cloud checkpointing release tests (#19638) 2021-10-29 12:12:01 +02:00