This change adds new unit tests and error message to _store_package_in_gcs(). In particular, it tests the function's behavior when it fails to connect to the GCS.
Use spot instances for chaos tests.
We can also experiment with other tests that don't suppose to have dead nodes, but let's do it once the nightly infra is stabilized
Currently, the github issue templates are quite verbose, and as a result few developers end up using them. Clean them up so that they can be the standard workflow for all Ray contributors.
What: Long running tests should use sdk file manager
Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.
What: Raise meaningful exceptions when invalid parameters are passed.
Why: We want to catch invalid parameters and guide users to use the API in the correct way.
Adds a dynamic property to easily obtain `config` dict from `Result`, extends the `ResultGrid.get_best_config` method for parity with `ExperimentAnalysis.get_best_trial` (allows for using of mode and metric different to the one set in the Tuner).
Currently Datasets primitives repartition, groupby, sort, and random_shuffle all use different internal shuffle implementations. This PR unifies them on a single internal ShuffleOp class. This class exposes static methods for map and reduce which must be implemented by the specific higher-level primitive. Then the ShuffleOp.execute method implements a simple pull-based shuffle by submitting one map task per input block and one reduce task per output block.
Closes#23593.
If rsync/ssh is not available (as in kubernetes setups), Tune previously had no fallback mechanism to synchronize trial directories to the driver. This PR introduces a `RemoteTaskSyncer` trial syncer that uses ray remote tasks to ship file contents between nodes. The implementation utilizes tarfile to compress files for transfer, and it only transfers files that differ between the source and target directory to minimize network bandwidth usage.
The trial syncer works as follows:
1. It collects information about existing files in the target directory. This directory could be remote (when syncing up) or local (when syncing down).
2. It then schedules a `pack` task on the source node. This will always be a remote task so the future can be passed to the unpack task. The pack task will only pack files that are not existent or different in the target directory into a tarfile, which is returned as a bytes string
3. An `unpack` task in scheduled on the target node. This will always be a remote task so the future can be awaited in a call to `wait()`
A test is added to ensure that only modified files are transferred on subsequent sync ups/downs.
Finally, minor changes are made to the `Syncer`/`NodeSyncer` classes to allow passing `(ip, path)` tuples rather than rsync-style remote paths.
Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests:
(1) Uses a local `working_dir`
(2) Uses a remote working_dir from a zip github URL.
Updates @ray.method error message to match the one for @ray.remote. Since the client mode version of ray.method is identical to the regular ray.method, deletes the client mode version and drops the client_mode_hook decorator (guessing that the client copy was added before client_mode_hook was introduced).
Also fixes what I'm guessing is a bug that doesn't allow both num_returns and concurrency_group to be specified at the same time (assert len(kwargs) == 1).
Closes#23271
Copied from #22571:
Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, currently we spill at most the object fusion size when instead we should be spilling at least the fusion size. Then we use the max number of fused objects as a cap.
This PR fixes the fusion behavior so that we always spill at minimum the fusion size. If we reach the end of the spillable objects, and we are under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.
Increases some test timeouts to allow tests to pass.
Initial draft of the interface for HuggingFaceTorchTrainer.
One alternative for limiting the number of datasets in datasets dict would be to have the user pass train_dataset and validation_dataset as separate arguments, though that would be inconsistent with TorchTrainer.
placement_group_test_5 is flakey. Reason is requesting PG with exact object store memory as node. If object store has object, then PG scheduling fails.
Also fix bug - typo.
* Uniformly distributed tasks among actors to utilize full concurrency
* Added test to ensure all tasks are launched at the same time
* Applied linting format
```
src/ray/common/test/ray_syncer_test.cc:495: Failure
| Expected: (s1.GetNumConsumedMessages(s2.syncer->GetLocalNodeID())) < (max_sends * 2 + 3), actual: 5 vs 5
```
This is measuring number of request send. For extreme case, they should equal. This PR fixed this.
Adding a FAQ page. Currently has some basic questions that have come up in the past.
Explaining how to use Matplotlib due to threading in the distributed training function.
While working on https://github.com/ray-project/ray/pull/20577 we noticed `requests` module is not blacked listed in minimal install test, but not sure why. As a result we missed coverage on P0 issue like https://github.com/ray-project/ray/issues/20574.
This is an attempt to see what would happen if we blacklist it and if we're able to get any signals from CI.
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Experiment tags are not always rendered in a sane way for all operating systems. For instance, a config of
```
"a": tune.choice([(3, 4), (5, 6)]),
"b": tune.choice([[7, 8], [6, 5]]),
```
will lead to an experiment dir like `lambda_53737_00000_0_a=_3, 4_,b=[7, 8]_2022-04-02_10-21-27/`. This can lead to problems with utilities such as gsutil (which misinterprets some characters as wildcards, see #23670), but also with e.g. MacOS which doesn't like `[` brackets in filenames.
This PR adds an improvement to the `_clean_value` function used to sanitize values. We specify a valid alphabet which includes a limited set of characters that is broadly usable in most operating systems. We also simplify the `format_vars` function - even though it was previously a bit more sophisticated in handling list items, this was error-prone, and can be replaced in favor of a better readable and simpler implementation that yields the same results in almost all cases.
Add bazel platform plugin for ray setup deps.
It will fail to build java related package on ubuntu lastest (ubuntu 20)/mac lastest 11.x version since bazel tools put a wrong platform verion in its deps, so all of users might get such exception
```
ERROR: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1[25](https://github.com/ray-project/mobius/runs/5273958213?check_suite_focus=true#step:5:25)5c5f5cefe240bb7613/external/bazel_tools/src/conditions/BUILD:61:15: no such target '@platforms//cpu:riscv64': target 'riscv64' not declared in package 'cpu' defined by /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/platforms/cpu/BUILD and referenced by '@bazel_tools//src/conditions:linux_riscv64'
INFO: Repository remote_coverage_tools instantiated at:
/DEFAULT.WORKSPACE.SUFFIX:3:13: in <toplevel>
Repository rule http_archive defined at:
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/build_defs/repo/http.bzl:364:[31](https://github.com/ray-project/mobius/runs/5273958213?check_suite_focus=true#step:5:31): in <toplevel>
INFO: Repository com_google_absl instantiated at:
/__w/mobius/mobius/streaming/WORKSPACE:16:15: in <toplevel>
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/com_github_ray_project_ray/bazel/ray_deps_setup.bzl:217:22: in ray_deps_setup
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/com_github_ray_project_ray/bazel/ray_deps_setup.bzl:76:24: in auto_http_archive
Repository rule http_archive defined at:
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
ERROR: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/jdk/BUILD:90:11: errors encountered resolving select() keys for @bazel_tools//tools/jdk:jni
```
The bazel dev suggests us to update platform mannually in this issue : https://github.com/bazelbuild/bazel/issues/14097.
It's to say that we reuse the old platforms plugin then fail to select a true jni setting on mips64 or riscv64 instruction if we don't download the new platform.
Co-authored-by: lingxuan.zlx <lingxuan.zlx@antgroup.com>
`api.py` has accumulated classes and functions that aren't purely public APIs, causing circular dependencies. This change pulls `Deployment` and deployment graph-related features out of `api.py` and puts them in two new files: `deployment.py` and `deployment_graph.py`.
Adds some metrics useful for object-intensive workloads:
Per raylet/object manager:
Add num bytes pending restore to spill manager
Add num requests cumulative to PullManager
Num bytes pushed/pulled from other nodes cumulative
Histogram for request latencies in PullManager:
total life time of request, from start to cancel
request satisfaction time, from start to object local
pull time, from object activation to object local
Per-node disk read/write speed, IOPS