This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.
Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand.
If this seems useful we can extend this to other examples gradually.
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
This PR move the poller and broadcaster from gcs server to ray syncer.
TODO in next PR: deprecate the code path of placement group resource reporting and move the broadcaster out of gcs cluster resource manager.
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
DatasetPipeline execution is coordinated by a pool of actors and optionally the driver process. To recover from failures with lineage reconstruction, we need to keep these actors alive as long as the driver is alive. Currently, they are spread randomly throughout the cluster, so they can be killed during a node failure.
This PR pins the actors to the same node as the driver so that they will survive any other node failures. It's also okay if the driver node dies, since the driver itself will also die.
Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit.
This adds a slight performance penalty for actor startup.
This PR fixes some minor bugs in `to_dict` and `from_dict` for the runtime env protobuf and adds a test to cover this codepath. The test checks that `to_dict` and `from_dict` are inverses. This PR contains all fixes required to make the test pass.
This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes.
### Performance Changes
A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling.
| Run | Produce (s) | Consume (s) | Total (s) |
| -- | -- | -- | -- |
| Baseline (original) | 347.332281 | 355.611272 | 705.560750 |
| Baseline (w/ fix) | 181.815852 | 347.692850 | 532.847759 |
| No fusing (original) | 453.574554 | 525.047998 | 981.620108 |
| No fusing (w/ fix) | 452.614848| 519.787698 | 975.412639 |
The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled.
See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
This is useful in combining multiple applied groups produced by groupby().map_groups() into a single one. For example, builder = BlockBuilder.for_block(type(batch)), and then for each applied group, builder.add_block(applied_group).
The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.
A follow up change of #22348
example is not up to date and we can not bring up the cluster due to missing configmap. Autoscaler is able to convert CR to autoscaler config so we don't need configmap anymore.
I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads.
[The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.
Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495
- The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`.
- The `new RuntimeEnv` will be exposed directly to the user.
- example:
```python
runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip",
pip=["requests"],
java_jars=["s3://jar1.zip"],
java_jvm_options=["-Dxxx=xxx"])
```
This PR supports setting actor level env vars for Java worker in runtime env.
General API looks like:
```java
RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
.addEnvVar("KEY1", "A")
.addEnvVar("KEY2", "B")
.addEnvVar("KEY1", "C") // This overwrites "KEY1" to "C"
.build();
ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
```
If `num-java-workers-per-process` > 1, it will never reuse the worker process except they have the same runtime envs.
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects.
Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported.
More details in https://github.com/ray-project/ray/pull/22526
Mobius has applied this change in https://github.com/ray-project/mobius/pull/28.
Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>
In Envs with K8S and enabled SELinux there is a bug:
"/proc/nvidia/" is not allowed to mount in container
So, i made a rework for GPU detection based on GPutil package.
## Checks
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Release tests
Co-authored-by: Mopga <a14415641@cab-wsm-0010669.sigma.sbrf.ru>
Co-authored-by: Julius <juliustfrost@gmail.com>
- Moved all `Deployment` instance creation to `DeploymentNode` level with only relevant info passed into it from `generate.py`. This abstraction makes more sense and less leaky.
- In `DeploymentNode`, we leverage ray core DAG's `_PyObjScanner` to find and replace only Deployment nodes init args & kwargs to deployment handle, which is only specific to `Deployment` instance, but not `DeploymentNode` itself. However this is the simplest and most robust way to handle nested args at `DAGNode` level.
- This implementation lives in ray core DAGNode level so we don't need to expose `_PyObjScanner` directly.
- Added serve pipeline tests to BUILD CI.