hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	dc994dbb02	[tune] Add RemoteTask based sync client (#23605 ) If rsync/ssh is not available (as in kubernetes setups), Tune previously had no fallback mechanism to synchronize trial directories to the driver. This PR introduces a `RemoteTaskSyncer` trial syncer that uses ray remote tasks to ship file contents between nodes. The implementation utilizes tarfile to compress files for transfer, and it only transfers files that differ between the source and target directory to minimize network bandwidth usage. The trial syncer works as follows: 1. It collects information about existing files in the target directory. This directory could be remote (when syncing up) or local (when syncing down). 2. It then schedules a `pack` task on the source node. This will always be a remote task so the future can be passed to the unpack task. The pack task will only pack files that are not existent or different in the target directory into a tarfile, which is returned as a bytes string 3. An `unpack` task in scheduled on the target node. This will always be a remote task so the future can be awaited in a call to `wait()` A test is added to ensure that only modified files are transferred on subsequent sync ups/downs. Finally, minor changes are made to the `Syncer`/`NodeSyncer` classes to allow passing `(ip, path)` tuples rather than rsync-style remote paths.	2022-04-05 21:35:25 +01:00
Archit Kulkarni	582bf4e8f8	Add basic jobs release test with Tune script (#23474 ) Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests: (1) Uses a local `working_dir` (2) Uses a remote working_dir from a zip github URL.	2022-04-05 13:31:11 -05:00
Chris K. W	9b79048963	Update error message for @ray.method (#23471 ) Updates @ray.method error message to match the one for @ray.remote. Since the client mode version of ray.method is identical to the regular ray.method, deletes the client mode version and drops the client_mode_hook decorator (guessing that the client copy was added before client_mode_hook was introduced). Also fixes what I'm guessing is a bug that doesn't allow both num_returns and concurrency_group to be specified at the same time (assert len(kwargs) == 1). Closes #23271	2022-04-05 11:12:55 -07:00
Stephanie Wang	1c972d5d2d	[core] Spill at least the object fusion size instead of at most (#22750 ) Copied from #22571: Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, currently we spill at most the object fusion size when instead we should be spilling at least the fusion size. Then we use the max number of fused objects as a cap. This PR fixes the fusion behavior so that we always spill at minimum the fusion size. If we reach the end of the spillable objects, and we are under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved. Increases some test timeouts to allow tests to pass.	2022-04-05 10:57:42 -07:00
Antoni Baum	ca6dfc8bb7	[AIR] Interface for `HuggingFaceTorchTrainer` (#23615 ) Initial draft of the interface for HuggingFaceTorchTrainer. One alternative for limiting the number of datasets in datasets dict would be to have the user pass train_dataset and validation_dataset as separate arguments, though that would be inconsistent with TorchTrainer.	2022-04-05 10:32:13 -07:00
liuyang-my	bdd3b9a0ab	[Serve] Unified Controller API for Cross Language Client (#23004 )	2022-04-05 09:20:02 -07:00
Sven Mika	434265edd0	[RLlib] Examples folder: All `training_iteration` translations. (#23712 )	2022-04-05 16:33:50 +02:00
jon-chuang	9c950e8979	[Core] Placement Group: Fix Flakey Test placement_group_test_5 and Typo (#23350 ) placement_group_test_5 is flakey. Reason is requesting PG with exact object store memory as node. If object store has object, then PG scheduling fails. Also fix bug - typo.	2022-04-05 05:33:43 -07:00
Gagandeep Singh	11baa22c1e	Split test_advanced_n.py and enabled cluster tests (#23524 )	2022-04-05 01:34:57 -07:00
Gagandeep Singh	8c87117bc3	Uniformly distributed tasks among actors to utilize full concurrency (#23416 ) * Uniformly distributed tasks among actors to utilize full concurrency * Added test to ensure all tasks are launched at the same time * Applied linting format	2022-04-05 01:05:41 -07:00
Matti Picus	96948a4a30	WINDOWS: skip flaky test (#23557 ) Continuation of #23462 to try to get test_ray_init to pass consistently in CI. The skipped test passes locally, so only skip it on CI.	2022-04-05 00:56:43 -07:00
Steven Morad	39841b65b3	[RLlib] PPOTorchPolicy: Remove extra call to `model.value_function` (#23671 )	2022-04-05 08:40:29 +02:00
mesjou	e725472b5b	[RLlib] Fix bug in prisoners dillemma example. (#23690 )	2022-04-05 08:36:20 +02:00
Jiajun Yao	5f37231842	Remove yapf dependency (#23656 ) Yapf has been replaced by black.	2022-04-04 21:50:04 -07:00
Clark Zinzow	08159eb668	[Datasets] Disallow callable classes for task compute strategy. (#23708 )	2022-04-04 21:12:36 -07:00
Yi Cheng	99ca8ee8e4	[flaky] Deflaky `ray_syncer_test` (#23703 ) ``` src/ray/common/test/ray_syncer_test.cc:495: Failure \| Expected: (s1.GetNumConsumedMessages(s2.syncer->GetLocalNodeID())) < (max_sends * 2 + 3), actual: 5 vs 5 ``` This is measuring number of request send. For extreme case, they should equal. This PR fixed this.	2022-04-04 19:38:58 -07:00
Siyuan (Ryans) Zhuang	ae86fb258e	[workflow] Fix workflow continuation resolving (#23682 ) * update test * return StaticWorkflowRef * reformat test	2022-04-04 17:39:24 -07:00
Amog Kamsetty	4530349506	[AIR] Set name of Trainable to match with Trainer #23697	2022-04-04 16:23:21 -07:00
matthewdeng	a12f5ff5d6	[train] add FAQ (#22757 ) Adding a FAQ page. Currently has some basic questions that have come up in the past. Explaining how to use Matplotlib due to threading in the distributed training function.	2022-04-04 16:14:35 -07:00
Jiajun Yao	a668e5d8db	Add perf metrics for stress tests (#23648 ) Added perf metrics for stress tests so they can be alerted on.	2022-04-05 08:09:27 +09:00
shrekris-anyscale	4aaa895137	[runtime_env] Reorganize tests in test_runtime_env_working_dir_2.py and test_runtime_env_working_dir_3.py (#23618 )	2022-04-04 17:35:49 -05:00
Kai Fricke	99a2aa013f	[ci] Remove existing artifacts pre-command with docker (#23655 ) Previously, pre-existing artifacts were not deleted pre-command because of permission issues. This can be fixed by running the remove command in another docker container. Seems to work well here: https://buildkite.com/ray-project/ray-builders-pr/builds/28683#322c7a9d-cba7-4c23-8b00-7ebc6144a777	2022-04-04 15:22:04 -07:00
Kai Fricke	b3b1498eba	[tune] Beautify Optional typehints (#23692 ) What: Changes `Union[None, type1, ..., typeN]` type hints to `Optional[type1, ..., typeN]` Why: Better readability, consistency across library, consistency with code style guides.	2022-04-04 19:48:34 +01:00
Edward Oakes	09123e3452	[serve][minor] Remove "statuses" key from `serve status` output (#23642 )	2022-04-04 11:11:26 -05:00
Jiao	ff6515b5a3	Remove `requests` from blacklist of minimal install test (#20584 ) While working on https://github.com/ray-project/ray/pull/20577 we noticed `requests` module is not blacked listed in minimal install test, but not sure why. As a result we missed coverage on P0 issue like https://github.com/ray-project/ray/issues/20574. This is an attempt to see what would happen if we blacklist it and if we're able to get any signals from CI. Co-authored-by: Jiao Dong <jiaodong@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-04-04 16:15:58 +01:00
Kai Fricke	40a8183e05	[ci/release] Fix job-based file download (#23657 ) have to wrap download call in a lambda to be compatible with run_with_retry	2022-04-04 08:06:31 -07:00
Kai Fricke	7e0c63ab9c	[tune] Simplify experiment tag formatting, clean directory names (#23672 ) Experiment tags are not always rendered in a sane way for all operating systems. For instance, a config of ``` "a": tune.choice([(3, 4), (5, 6)]), "b": tune.choice([[7, 8], [6, 5]]), ``` will lead to an experiment dir like `lambda_53737_00000_0_a=_3, 4_,b=[7, 8]_2022-04-02_10-21-27/`. This can lead to problems with utilities such as gsutil (which misinterprets some characters as wildcards, see #23670), but also with e.g. MacOS which doesn't like `[` brackets in filenames. This PR adds an improvement to the `_clean_value` function used to sanitize values. We specify a valid alphabet which includes a limited set of characters that is broadly usable in most operating systems. We also simplify the `format_vars` function - even though it was previously a bit more sophisticated in handling list items, this was error-prone, and can be replaced in favor of a better readable and simpler implementation that yields the same results in almost all cases.	2022-04-04 16:05:47 +01:00
Andrew Bauer	3e7c8231a8	Apply 'Incorrect pickles for subclasses of generic classes #448 ' from cloudpickle (#22553 ) Co-authored-by: Chen Shen <scv119@gmail.com>	2022-04-04 00:06:39 -07:00
Lingxuan Zuo	e7ad617d6a	[Bazel]ray deps import lastest bazel platform (#23653 ) Add bazel platform plugin for ray setup deps. It will fail to build java related package on ubuntu lastest (ubuntu 20)/mac lastest 11.x version since bazel tools put a wrong platform verion in its deps, so all of users might get such exception ``` ERROR: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1[25](https://github.com/ray-project/mobius/runs/5273958213?check_suite_focus=true#step:5:25)5c5f5cefe240bb7613/external/bazel_tools/src/conditions/BUILD:61:15: no such target '@platforms//cpu:riscv64': target 'riscv64' not declared in package 'cpu' defined by /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/platforms/cpu/BUILD and referenced by '@bazel_tools//src/conditions:linux_riscv64' INFO: Repository remote_coverage_tools instantiated at: /DEFAULT.WORKSPACE.SUFFIX:3:13: in <toplevel> Repository rule http_archive defined at: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/build_defs/repo/http.bzl:364:[31](https://github.com/ray-project/mobius/runs/5273958213?check_suite_focus=true#step:5:31): in <toplevel> INFO: Repository com_google_absl instantiated at: /__w/mobius/mobius/streaming/WORKSPACE:16:15: in <toplevel> /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/com_github_ray_project_ray/bazel/ray_deps_setup.bzl:217:22: in ray_deps_setup /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/com_github_ray_project_ray/bazel/ray_deps_setup.bzl:76:24: in auto_http_archive Repository rule http_archive defined at: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel> ERROR: /github/home/.cache/bazel/_bazel_root/fa5a074cd6f1255c5f5cefe240bb7613/external/bazel_tools/tools/jdk/BUILD:90:11: errors encountered resolving select() keys for @bazel_tools//tools/jdk:jni ``` The bazel dev suggests us to update platform mannually in this issue : https://github.com/bazelbuild/bazel/issues/14097. It's to say that we reuse the old platforms plugin then fail to select a true jni setting on mips64 or riscv64 instruction if we don't download the new platform. Co-authored-by: lingxuan.zlx <lingxuan.zlx@antgroup.com>	2022-04-03 12:58:22 +08:00
Larry	d0b324990f	[Java] Add doc for Ray.get api that throws an exception if it times out (#23666 ) Add doc for Ray.get api that throws an exception if it times out ![image](https://user-images.githubusercontent.com/11072802/161364231-4337124d-3141-4334-879c-f88cecc0d818.png) Co-authored-by: 稚鱼 <lianjunwen.ljw@antgroup.com>	2022-04-02 18:29:19 +08:00
Yi Cheng	e1a974aa9c	[gcs] Remove not useful options in redis client options. (#23572 ) This PR removes not useful options in Redis client options.	2022-04-01 14:41:15 -07:00
shrekris-anyscale	071e1dd20f	[serve] Create `deployment.py` and `deployment_graph.py` (#23578 ) `api.py` has accumulated classes and functions that aren't purely public APIs, causing circular dependencies. This change pulls `Deployment` and deployment graph-related features out of `api.py` and puts them in two new files: `deployment.py` and `deployment_graph.py`.	2022-04-01 13:40:13 -07:00
Tao Wang	2ce3cd0073	[Hotfix]Fix compile failure (#23651 )	2022-04-01 13:23:11 -07:00
Kai Fricke	9071b39f3e	[ci/release] Add buildkite output groups (#23658 ) This makes the buildkite output easier to parse and interpret.	2022-04-01 13:04:22 -07:00
Stephanie Wang	b43426bc33	[core] Add metrics for disk and network I/O (#23546 ) Adds some metrics useful for object-intensive workloads: Per raylet/object manager: Add num bytes pending restore to spill manager Add num requests cumulative to PullManager Num bytes pushed/pulled from other nodes cumulative Histogram for request latencies in PullManager: total life time of request, from start to cancel request satisfaction time, from start to object local pull time, from object activation to object local Per-node disk read/write speed, IOPS	2022-04-01 11:15:34 -07:00
Jiajun Yao	c5c5c24e8f	Remove unused ObjectDirectory::LookupLocations() (#23647 ) Remove dead code.	2022-04-01 10:03:37 -07:00
shrekris-anyscale	d4747d28eb	[serve] Set `"memory"` to `None` in `ray_actor_options` by default (#23619 ) * Make default memory 1 * Add test to validate that ReplicaConfig's default memory cannot be lower than minimum * Add a new option to memory_omitted_options * Update if branch in test_replica_config_default_memory_minimum * Make memory default value None	2022-04-01 09:14:44 -07:00
Kai Fricke	fe27dbcd9a	[air/release] Improve file packing/unpacking (#23621 ) We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes. Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).	2022-04-01 07:38:14 -07:00
Sven Mika	0bb82f29b6	[RLlib] AlphaStar polishing (fix logger.info bug). (#22281 )	2022-04-01 09:49:41 +02:00
Lingxuan Zuo	4510c2df90	[Python] export cython module for external project (#23579 ) A lot of cython data types have been defined in ray cython module, but outside project cannot reuse these since ray doesn't export all of .pxd files. To fix mobius python building error (https://github.com/ray-project/mobius/runs/5740167581?check_suite_focus=true) : no found ray common.pxd, etc. , According to cython document https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html we might add this package_data parameter in setup.py ```python setup( package_data = { 'my_package': ['.pxd'], 'my_package/sub_package': ['*.pxd'], }, ... ) ``` Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-04-01 10:31:33 +08:00
mwtian	1a4c3c07f7	[Dashboard] fix iterating over GPU processes (#23562 ) Current logic looks broken, as reported in #22954 (comment) I fixed the logic as best as I can, and tested it on Anyscale platform with GPU. No process info was reported from gpustat. But the logic works under this case.	2022-03-31 17:16:53 -07:00
Hao Chen	75f1861625	Remove predefined resources vector in ResourceRequest (#23584 ) "ResourceRequest" now uses 2 containers: a vector for predefined resources, and a map for custom resources. This was intended to be a perf optimization. However, in practice, this makes the code more complex, and, moreover, prevents optimizations for some methods (e.g., "ResourceIds", "Size"). This PR removes the vector and makes ResourceRequest use only one map for all resources. Also, "ResourceIds" now returns a "boost:range" to allow iterating resource IDs without having to construct temporary sets. microbenchmark shows a slight perf improvement. last nightly: `placement group create/removal per second 837.76 +- 16.68`. this PR: `placement group create/removal per second 895.76 +- 16.99`.	2022-03-31 17:16:11 -07:00
Yi Cheng	5a2ab76af8	[flaky] Release gcs client in test (#23644 ) To deflaky gcs_client_test, this PR tries to release the client object.	2022-03-31 16:57:50 -07:00
xwjiang2010	378b66984f	[air] reduce unnecessary stacktrace (#23475 ) There are a few changes: 1. Between runner thread and main thread: The same stacktrace is raised in `_report_thread_runner_error` in main thread. So we could spare this raise in runner thread. 2. Between function runner and Tune driver: Do not wrap RayTaskError in TuneError. 3. Within Tune driver code: Introduces a per errored trial error.pkl and uses that to populate ResultGrid. Plus some cleanups to facilitate propagating exception in runner and executor code. Final stacktrace looks like: (omitted) In Tune, we are capturing `traceback.format_exc` at the time the exception is caught and just pass the string around. This PR slightly changes that only in the case of when RayTaskError is raised, and we pass that object around. It may be worthwhile to settle down on a practice of error handling in Tune in general. I am also curious to learn how other ray library does that and any good lessons to learn. In particular, we should watch out for memory leaking in exception handling. Not sure if it is still a problem in python 3, but here are some articles I came across for reference https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/	2022-03-31 22:59:58 +01:00
Yi Cheng	8d7f71601d	deflaky ray syncer test (#23641 )	2022-03-31 13:42:30 -07:00
Sven Mika	2eaa54bd76	[RLlib] POC: Config objects instead of dicts (PPO only). (#23491 )	2022-03-31 18:26:12 +02:00
mwtian	bd4d6b7e19	[Java] upgrade protobuf-java version (#23627 )	2022-03-31 09:12:58 -07:00
Antoni Baum	756d08cd31	[docs] Add support for external markdown (#23505 ) This PR fixes the issue of diverging documentation between Ray Docs and ecosystem library readmes which live in separate repos (eg. xgboost_ray). This is achieved by adding an extra step before the docs build process starts that downloads the readmes of specified ecosystem libraries from their GitHub repositories. The files are then preprocessed by a very simple parser to allow for differences between GitHub and Docs markdowns. In summary, this makes the markdown files in ecosystem library repositories single sources of truth and removes the need to manually keep the doc pages up to date, all the while allowing for differences between what's rendered on GitHub and in the Docs. See ray-project/xgboost_ray#204 & https://ray--23505.org.readthedocs.build/en/23505/ray-more-libs/xgboost-ray.html for an example. Needs ray-project/xgboost_ray#204 and ray-project/lightgbm_ray#30 to be merged first.	2022-03-31 08:38:14 -07:00
Andrew Sedler	853f6d6de3	[Bug][Tune] Fix bugs that cause hanging `PAUSED` trials with `PopulationBasedTrainingScheduler` (#23472 ) As discussed in #23424, the synch=True mode of PopulationBasedTrainingScheduler is (1) not compatible with burn_in_period and (2) causes the presence of TERMINATED trials to hang PAUSED trials indefinitely. This change addresses (1) by setting the initial _next_perturbaton_sync to the max of burn_in_period and perturbation_interval in the constructor and (2) by checking only whether live trials have reached the _next_perturbation_sync before resuming PAUSED trials.	2022-03-31 08:33:51 -07:00
simonsays1980	9ca9c67bc9	[RLlib] Added dtype safeguards to the 'required_model_output_shape()' methods… (#23490 )	2022-03-31 13:52:00 +02:00

... 3 4 5 6 7 ...

12181 commits