```
src/ray/common/test/ray_syncer_test.cc:495: Failure
| Expected: (s1.GetNumConsumedMessages(s2.syncer->GetLocalNodeID())) < (max_sends * 2 + 3), actual: 5 vs 5
```
This is measuring number of request send. For extreme case, they should equal. This PR fixed this.
Adding a FAQ page. Currently has some basic questions that have come up in the past.
Explaining how to use Matplotlib due to threading in the distributed training function.
While working on https://github.com/ray-project/ray/pull/20577 we noticed `requests` module is not blacked listed in minimal install test, but not sure why. As a result we missed coverage on P0 issue like https://github.com/ray-project/ray/issues/20574.
This is an attempt to see what would happen if we blacklist it and if we're able to get any signals from CI.
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Experiment tags are not always rendered in a sane way for all operating systems. For instance, a config of
```
"a": tune.choice([(3, 4), (5, 6)]),
"b": tune.choice([[7, 8], [6, 5]]),
```
will lead to an experiment dir like `lambda_53737_00000_0_a=_3, 4_,b=[7, 8]_2022-04-02_10-21-27/`. This can lead to problems with utilities such as gsutil (which misinterprets some characters as wildcards, see #23670), but also with e.g. MacOS which doesn't like `[` brackets in filenames.
This PR adds an improvement to the `_clean_value` function used to sanitize values. We specify a valid alphabet which includes a limited set of characters that is broadly usable in most operating systems. We also simplify the `format_vars` function - even though it was previously a bit more sophisticated in handling list items, this was error-prone, and can be replaced in favor of a better readable and simpler implementation that yields the same results in almost all cases.
Add bazel platform plugin for ray setup deps.
It will fail to build java related package on ubuntu lastest (ubuntu 20)/mac lastest 11.x version since bazel tools put a wrong platform verion in its deps, so all of users might get such exception
```
ERROR: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1[25](https://github.com/ray-project/mobius/runs/5273958213?check_suite_focus=true#step:5:25)5c5f5cefe240bb7613/external/bazel_tools/src/conditions/BUILD:61:15: no such target '@platforms//cpu:riscv64': target 'riscv64' not declared in package 'cpu' defined by /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/platforms/cpu/BUILD and referenced by '@bazel_tools//src/conditions:linux_riscv64'
INFO: Repository remote_coverage_tools instantiated at:
/DEFAULT.WORKSPACE.SUFFIX:3:13: in <toplevel>
Repository rule http_archive defined at:
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/build_defs/repo/http.bzl:364:[31](https://github.com/ray-project/mobius/runs/5273958213?check_suite_focus=true#step:5:31): in <toplevel>
INFO: Repository com_google_absl instantiated at:
/__w/mobius/mobius/streaming/WORKSPACE:16:15: in <toplevel>
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/com_github_ray_project_ray/bazel/ray_deps_setup.bzl:217:22: in ray_deps_setup
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/com_github_ray_project_ray/bazel/ray_deps_setup.bzl:76:24: in auto_http_archive
Repository rule http_archive defined at:
/github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
ERROR: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/jdk/BUILD:90:11: errors encountered resolving select() keys for @bazel_tools//tools/jdk:jni
```
The bazel dev suggests us to update platform mannually in this issue : https://github.com/bazelbuild/bazel/issues/14097.
It's to say that we reuse the old platforms plugin then fail to select a true jni setting on mips64 or riscv64 instruction if we don't download the new platform.
Co-authored-by: lingxuan.zlx <lingxuan.zlx@antgroup.com>
`api.py` has accumulated classes and functions that aren't purely public APIs, causing circular dependencies. This change pulls `Deployment` and deployment graph-related features out of `api.py` and puts them in two new files: `deployment.py` and `deployment_graph.py`.
Adds some metrics useful for object-intensive workloads:
Per raylet/object manager:
Add num bytes pending restore to spill manager
Add num requests cumulative to PullManager
Num bytes pushed/pulled from other nodes cumulative
Histogram for request latencies in PullManager:
total life time of request, from start to cancel
request satisfaction time, from start to object local
pull time, from object activation to object local
Per-node disk read/write speed, IOPS
* Make default memory 1
* Add test to validate that ReplicaConfig's default memory cannot be lower than minimum
* Add a new option to memory_omitted_options
* Update if branch in test_replica_config_default_memory_minimum
* Make memory default value None
We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes.
Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).
Current logic looks broken, as reported in #22954 (comment)
I fixed the logic as best as I can, and tested it on Anyscale platform with GPU. No process info was reported from gpustat. But the logic works under this case.
"ResourceRequest" now uses 2 containers: a vector for predefined resources, and a map for custom resources.
This was intended to be a perf optimization. However, in practice, this makes the code more complex, and, moreover, prevents optimizations for some methods (e.g., "ResourceIds", "Size").
This PR removes the vector and makes ResourceRequest use only one map for all resources. Also, "ResourceIds" now returns a "boost:range" to allow iterating resource IDs without having to construct temporary sets.
microbenchmark shows a slight perf improvement.
last nightly: `placement group create/removal per second 837.76 +- 16.68`.
this PR: `placement group create/removal per second 895.76 +- 16.99`.
There are a few changes:
1. Between runner thread and main thread: The same stacktrace is raised in `_report_thread_runner_error` in main thread. So we could spare this raise in runner thread.
2. Between function runner and Tune driver: Do not wrap RayTaskError in TuneError.
3. Within Tune driver code: Introduces a per errored trial error.pkl and uses that to populate ResultGrid.
Plus some cleanups to facilitate propagating exception in runner and executor code.
Final stacktrace looks like: (omitted)
In Tune, we are capturing `traceback.format_exc` at the time the exception is caught and just pass the string around. This PR slightly changes that only in the case of when RayTaskError is raised, and we pass that object around.
It may be worthwhile to settle down on a practice of error handling in Tune in general.
I am also curious to learn how other ray library does that and any good lessons to learn.
In particular, we should watch out for memory leaking in exception handling. Not sure if it is still a problem in python 3, but here are some articles I came across for reference
https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
This PR fixes the issue of diverging documentation between Ray Docs and ecosystem library readmes which live in separate repos (eg. xgboost_ray). This is achieved by adding an extra step before the docs build process starts that downloads the readmes of specified ecosystem libraries from their GitHub repositories. The files are then preprocessed by a very simple parser to allow for differences between GitHub and Docs markdowns.
In summary, this makes the markdown files in ecosystem library repositories single sources of truth and removes the need to manually keep the doc pages up to date, all the while allowing for differences between what's rendered on GitHub and in the Docs.
See ray-project/xgboost_ray#204 & https://ray--23505.org.readthedocs.build/en/23505/ray-more-libs/xgboost-ray.html for an example.
Needs ray-project/xgboost_ray#204 and ray-project/lightgbm_ray#30 to be merged first.
As discussed in #23424, the synch=True mode of PopulationBasedTrainingScheduler is (1) not compatible with burn_in_period and (2) causes the presence of TERMINATED trials to hang PAUSED trials indefinitely.
This change addresses (1) by setting the initial _next_perturbaton_sync to the max of burn_in_period and perturbation_interval in the constructor and (2) by checking only whether live trials have reached the _next_perturbation_sync before resuming PAUSED trials.
As we (@scv119 @iycheng @raulchen @Chong-Li @WangTaoTheTonic ) discussed offline, the GcsResourceScheduler on the GCS side should be unified to ClusterResourceScheduler.
There is already a big PR( #23268 ) to do this, but in order to make review easy, I will split it to two or mall small PRs.
This is [3/n]:
Move the implementation of all policies from gcs_resource_scheduler to bundle_scheduling_plocy
Delete gcs_resource_scheduler
Refactor gcs_resource_scheduler_test to cluster_resource_scheduler_2_test
BTW: The interface inside ISchedulingPolicy should be refactor in another PR, see the discussion #23323 (comment)
To be clear:
scorer related codes are moved out from gcs_resoruce_scheduler to scorer.h/.cc and no logic changes.
Policy related codes are moved out from gcs_resoruce_scheduler to bundle_scheduling_policy.h/.cc, and a small part of the logic in "GcsResourceScheduler::Schedule" is distributed into each policy.
Some codes inside gcs_placement_group_scheduler.h/.cc are changed to adapt to new data structure (SchedulingResult and SchedulingContext)
Support filtering tests by test attr regex filters. Multiple filters can be specified with one line for each filter. The format is attr:regex (e.g. team:serve)