Commit graph

11456 commits

Author SHA1 Message Date
Stephanie Wang
73f078236f
[doc] Update docs about actor garbage collection (#20763)
Update outdated actor docs about when actors are GCed.
2022-02-28 18:45:29 -08:00
Jian Xiao
7597f1590b
[Dataset] fix some comments (#22700) 2022-02-28 17:13:43 -08:00
Jiaxin Shan
32829ff9ad
[KubeRay] Provide a new Dockerfile for fast build (#22689)
Adds a new Dockerfile for fast build and development of KubeRay.
2022-02-28 17:09:16 -08:00
Archit Kulkarni
85657b1377
[Doc] [Jobs] add CLI and SDK reference to docs (#22680) 2022-02-28 17:57:46 -06:00
Chris K. W
fa6b3c7c89
[aws][autoscaler] fix regional default AMI's (#22506)
The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.
2022-02-28 15:52:57 -08:00
jon-chuang
3bc0858a4f
[Core/GCS] remove default 100 concurrent rate limit for heartbeat (#22613)
better scalability

Closes https://github.com/ray-project/ray/issues/20773
2022-02-28 15:26:05 -08:00
SangBin Cho
2c1184592e
mark threaded actor test unstable (#22696) 2022-02-28 15:25:14 -08:00
Clark Zinzow
cf3577f0ee
[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665) 2022-02-28 15:15:30 -08:00
Chen Shen
7e90700521
[Dataset][nighly-test] promote data ingestion test to stable #22702 2022-02-28 14:00:18 -08:00
Simon Mo
fe3d501d68
[Core] Include java worker log with log monitor (#22629) 2022-02-28 12:30:04 -08:00
Kai Fricke
3695408a85
[release] Fix special cases in release test package (e.g. smoke test) (#22442)
Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.
2022-02-28 21:05:01 +01:00
SangBin Cho
ba4f1423c7
Revert "Support creating a DatasetPipeline windowed by bytes (#22577)" (#22695)
This reverts commit b5b4460932.
2022-02-28 11:56:12 -08:00
Jiaxin Shan
82daf2b041
[KubeRay] Remove configmap reference in example (#22688)
A follow up change of #22348

example is not up to date and we can not bring up the cluster due to missing configmap. Autoscaler is able to convert CR to autoscaler config so we don't need configmap anymore.
2022-02-28 10:13:08 -08:00
SangBin Cho
08374e8af4
Revert "[core] Fix bug in fusion for spilled objects (#22571)" (#22694)
Makes 2 tests flaky
2022-02-28 10:11:14 -08:00
Kai Fricke
e84e967932
[ml] Add basic Ray ML interfaces (#22436)
This PR adds the basic shared Ray ML interfaces.
2022-02-28 13:16:40 +01:00
Jialing He
aa1885ae2a
[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588)
I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads.

[The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.
2022-02-28 17:33:13 +08:00
Jun Gong
22bc451102
[RLlib] Fix a memeory leak in SimpleReplyBuffer that completely kills sampling throughput (#22678) 2022-02-28 09:28:04 +01:00
Jialing He
98a69cbd90
[runtime env][strong-typed API] Combine ParsedRuntimeEnv and RuntimeEnv into ray.runtime.RuntimeEnv (#22522)
Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495

- The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`.
- The `new RuntimeEnv` will be exposed directly to the user.
- example:
```python
runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", 
        pip=["requests"],
        java_jars=["s3://jar1.zip"],
        java_jvm_options=["-Dxxx=xxx"])
```
2022-02-28 16:18:10 +08:00
Qing Wang
9572bb717f
[RuntimeEnv] Support setting actor level env vars for Java worker (#22240)
This PR supports setting actor level env vars for Java worker in runtime env.
General API looks like:
```java
RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
    .addEnvVar("KEY1", "A")
    .addEnvVar("KEY2", "B")
    .addEnvVar("KEY1", "C")  // This overwrites "KEY1" to "C"
    .build();

ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
```

If `num-java-workers-per-process` > 1, it will never reuse the worker process except they have the same runtime envs.

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-02-28 10:58:37 +08:00
Lingxuan Zuo
94caac8722
Remove exporting symbols (#22623)
To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects.

Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported.

More details in https://github.com/ray-project/ray/pull/22526

Mobius has applied this change in https://github.com/ray-project/mobius/pull/28.

Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>
2022-02-28 09:41:10 +08:00
mopga
6f68c74a5d
Use GPUtil for gpu detection when available (#18938)
In Envs with K8S and enabled SELinux there is a bug:
"/proc/nvidia/" is not allowed to mount in container
So, i made a rework for GPU detection based on GPutil package.



## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Release tests

Co-authored-by: Mopga <a14415641@cab-wsm-0010669.sigma.sbrf.ru>
Co-authored-by: Julius <juliustfrost@gmail.com>
2022-02-27 14:54:35 -08:00
Max Pumperla
372c620f58
[docs] Tune overhaul part II (#22656)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-02-26 23:07:34 -08:00
Jiao
25d60d9cc9
[3/X][Pipeline] Handle deployment handle replacement in DeploymentNode init args, support nested (#22646)
- Moved all `Deployment` instance creation to `DeploymentNode` level with only relevant info passed into it from `generate.py`. This abstraction makes more sense and less leaky.
- In `DeploymentNode`, we leverage ray core DAG's `_PyObjScanner` to find and replace only Deployment nodes init args & kwargs to deployment handle, which is only specific to `Deployment` instance, but not `DeploymentNode` itself. However this is the simplest and most robust way to handle nested args at `DAGNode` level.
  - This implementation lives in ray core DAGNode level so we don't need to expose  `_PyObjScanner` directly.
- Added serve pipeline tests to BUILD CI.
2022-02-26 09:57:59 -06:00
Eric Liang
b5b4460932
Support creating a DatasetPipeline windowed by bytes (#22577) 2022-02-25 23:31:10 -08:00
SangBin Cho
1cedb1b6e4
[Test] Increase timeout for microbenchmark (#22655) 2022-02-25 17:29:12 -08:00
Eric Liang
ae16aa1dba
Add some sanity checks for memory use in dataset (#22642) 2022-02-25 16:59:12 -08:00
Simon Mo
4bf587f7ff
[Serve] make client poll more frequently (#22666) 2022-02-25 14:56:18 -08:00
Stephanie Wang
0da541bb71
[core] Fix bug in fusion for spilled objects (#22571)
Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues.

However, the current logic always spills once we reach the end of the spillable objects or once we've reached the fusion threshold. This can produce lots of unfused objects if they are created concurrently with the spill.

This PR changes the spill logic: once we reach the end of the spillable objects, if the last batch of spilled objects is under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.
2022-02-25 13:24:05 -08:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Edward Oakes
65313621ab
[codeowners] Swap joeybai for edoakes as snapshot codeowner (#22660) 2022-02-25 12:55:07 -06:00
Archit Kulkarni
31332f8930
[serve] [release tests] Add health check grace period for 1k deployment (#22651) 2022-02-25 12:13:44 -06:00
shrekris-anyscale
8548affdc2
Increase test_failed_job_status timeout in test_job_submission (#22643)
`test_job_submission` has become [flakey](https://flakey-tests.ray.io/) due to timeout. This change increases the timeout in `test_failed_job_status` from 10 to 25 seconds.
2022-02-25 10:08:55 -08:00
Stephanie Wang
634ca9afdb
[core] Cleanup handling for nondeterministic object size during transfer (#22639)
Currently object transfers assume that the object size is fixed. This is a bad assumption during failures, especially with lineage reconstruction enabled and tasks with nondeterministic outputs.

This PR cleans up the handling and hopefully guards against two cases where the object size may change during a transfer:
1. The object manager's size information does not match the object in the local plasma store (due to async notifications). --> the object manager overwrites its own information if it finds that the physical object has a different size.
2. The receiver's created buffer size does not match the sender's object size. --> the receiver destroys the previous buffer and creates a new buffer with the correct size. This might cause some transient errors but eventually object transfer should succeed.

Unfortunately I couldn't trigger this from Python because it depends on some pretty specific timing conditions. However, I did add some unit tests for case 2 (this is the majority of the PR).
2022-02-25 09:39:14 -08:00
xwjiang2010
62b2c26041
[tune] increase timeout for ray_trial_executor_test. (#22658) 2022-02-25 08:39:19 -08:00
Antoni Baum
d5284a740c
[tune] Remove Trainable.update_resources (#22471) 2022-02-25 08:38:34 -08:00
xwjiang2010
d4a1bc7bc7
Revert "[runtime env] runtime env inheritance refactor (#22244)" (#22626)
Breaks train_torch_linear_test.py.
2022-02-25 08:42:30 -06:00
shrekris-anyscale
e85540a1a2
[serve] Expose deployment statuses in REST API (#22611) 2022-02-25 08:41:07 -06:00
Chen Shen
89aaa79ee9
[resource scheduler] unify the GetBestSchedulableNode into one public method. (#22560)
* clean up cluster resource scheduler

* address comments

* always prioritize local node when spill back waiting tasks

* address comments
2022-02-25 01:09:21 -08:00
Dmitri Gekhtman
b2b442297e
[autoscaler] Fix initialization artifacts (#22570)
This PR fixes initializations artifacts related to the load metric summary and autoscaler summary.

Load metrics summaries are defined to be Falsey if the autoscaler has never received a resource message from the GCS.
We skip most autoscaler actions if load metrics is Falsey, because it doesn't makes sense to autoscale without load metrics. This also allows us to execute the TODO here: #22348 (comment) and remove the time.wait().

As for the autoscaler summary, it is possible for autoscaler.summary() to error outside of an autoscaler update in this scenario:
The very first call to NodeProvider.non_terminated_nodes fails, self.non_terminated_nodes remains a None object, and autoscaler.summary() fails trying to get an attribute of this None object.
The result is a confusing error message, as in #22515. This PR fixes that.

Closes #22515
2022-02-24 20:05:44 -08:00
Simon Mo
bfb619a127
[xlang] Allow Python to call overloaded methods with differing number of parameters (#21410) 2022-02-24 16:51:38 -08:00
Archit Kulkarni
1165f99b0b
[CI] disable Serve microbenchmark k8s (#22631) 2022-02-24 16:50:06 -08:00
Yi Cheng
de76d86bcb
[nightly] Stop GCS HA related nightly test (#22636)
Since we've already turned it on on master, we should stop these tests for now.
2022-02-24 16:40:08 -08:00
ZhuSenlin
5efeb6534b
[Core] Bug fix about FixedPoint (#22584)
* Fix FixedPoint::operator-(double const d)

* add unit test

* remove FixedPoint(uint32_t i)

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-02-24 15:44:21 -08:00
Jiao
3c707f70cc
[2/X][Pipeline] Add python generation for ClassNode (#22617)
- Added backbone of ray dag -> serve dag transformation and deployment extraction.
- Added util functions for deployment unique name generation .. ray_actor_options, replacement of DeploymentNode with deployment handle, etc.
2022-02-24 16:01:35 -06:00
Jun Gong
a385c9b127
[RLlib] Update bandit_envs_recommender_system (#22421) 2022-02-24 22:43:41 +01:00
Simon Mo
3d3218d153
[CI] Add K8s Builder Step (#22035) 2022-02-24 13:11:38 -08:00
Sven Mika
526fd6b5fb
[RLlib] Issue 22444: KL-coeff not stored in persistent policy state. (#22590) 2022-02-24 22:05:36 +01:00
Siyuan (Ryans) Zhuang
8f4f3cb79b
Make shellcheck optional 2022-02-24 12:04:05 -08:00
Eric Liang
533a0440a6
Improve actor pool support in Datasets (#22574) 2022-02-24 12:01:36 -08:00
Amog Kamsetty
02cb974c6c
[Train] Fix fault tolerance for Tensorflow (#22508)
Soft restarts don't work for tensorflow since there is still some leftover communication state in the actors which may lead to undefined behavior, such as causing training to hang.

Instead, this PR changes the failure handling for tensorflow to match torch and horovod, and recreates all the workers in case of failure. Also adds a test to check if fault tolerance works correctly for an actual tensorflow example. When testing locally, the test failed before the change, but passes after.
2022-02-24 11:50:20 -08:00