hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
xwjiang2010	7c67a4f1d0	[tuner] update tuner doc (#23753 )	2022-04-07 11:10:17 -07:00
Kai Fricke	73d1610e69	[ci/release] Fix pipeline build for empty PR repo (#23775 ) What: If BUILDKITE_PULL_REQUEST_REPO is empty string, default to DEFAULT_REPO Why: BUILDKITE_PULL_REQUEST_REPO is set to an empty string per default, thus we're currently not detecting the buildkite repo correctly in branched builds.	2022-04-07 09:29:48 -07:00
Antoni Baum	434d457ad1	[tune] Improve missing search dependency info (#23691 ) Replaces FLAML searchers with a dummy class that throws an informative error on init if FLAML is not installed, removes ConfigSpace import in BOHB example code, adds a note to examples using external dependencies.	2022-04-07 08:53:27 -07:00
shrekris-anyscale	a6bcb6cd1e	[serve] Create `application.py` (#23759 ) The `Application` class is stored in `api.py`. The object is relatively standalone and is used as a dependency in other classes, so this change moves `Application` (and `ImmutableDeploymentDict`) to a new file, `application.py`.	2022-04-07 10:34:24 -05:00
shrekris-anyscale	0902ec537d	[serve] Include full traceback in deployment update error message (#23752 ) When deployments fail to update, [Serve sets their status to UNHEALTHY and logs the error message](`46465abd6d/python/ray/serve/deployment_state.py (L1507-L1511)`). However, the message lacks a traceback, making it impossible to find what caused it. [For example](https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SfGPJq8WWJUhAvmHHsDgJWUe?command-history-section=command_history): ``` File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}") RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment: '>' not supported between instances of 'NoneType' and 'int'. ``` It's not clear where `'>' not supported between instances of 'NoneType' and 'int'.` is being triggered. The change includes the full traceback for this type of update failure. The new status message is easier to debug: ``` File "/Users/shrekris/Desktop/ray/python/ray/serve/api.py", line 328, in _wait_for_deployment_healthy raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}") RuntimeError: Deployment A is UNHEALTHY: Failed to update deployment: Traceback (most recent call last): File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1503, in update running_replicas_changed \|= self._check_and_update_replicas() File "/Users/shrekris/Desktop/ray/python/ray/serve/deployment_state.py", line 1396, in _check_and_update_replicas a = 1/0 ZeroDivisionError: division by zero ``` (I forced a divide-by-zero error to get this traceback).	2022-04-07 10:34:00 -05:00
Artur Niederfahrenhorst	02a50f02b7	[RLlib] RepayBuffer: `_hit_counts` working again. (#23586 )	2022-04-07 10:56:25 +02:00
Sven Mika	0b3a79ca41	[RLlib] Issue 23639: Error in client/server setup when using LSTMs (#23740 )	2022-04-07 10:16:22 +02:00
Sven Mika	e391b624f0	[RLlib] Re-enable (for CI-testing) our two self_play example scripts. (#23742 )	2022-04-07 08:20:48 +02:00
Sven Mika	b97fb4345b	[RLlib] Adding Artur and Steven to RLlib code owners. (#23755 )	2022-04-07 08:19:04 +02:00
Guyang Song	916ef0bf10	add C++ worker code owners (#23730 )	2022-04-07 11:30:30 +08:00
Kai Fricke	7b86a05efd	[ci/release] Parse PR github repos correctly (#23757 ) What: Correctly infer github repo from PRs iin Buildkite environments Why: For PRs, we need to checkout the correct github repo and branch so we can kick off release tests directly from PRs. Test run (from this PR!): https://buildkite.com/ray-project/release-tests-pr/builds/20#7f5a6526-0040-4896-b23a-f4896c75973d	2022-04-06 17:34:20 -07:00
shrekris-anyscale	64d98fb385	[serve] Add unit tests and better error messages to `_store_package_in_gcs()` (#23576 ) This change adds new unit tests and error message to _store_package_in_gcs(). In particular, it tests the function's behavior when it fails to connect to the GCS.	2022-04-06 17:34:10 -07:00
SangBin Cho	47ff1241f9	[Test] Use spot instances for chaos tests. (#23679 ) Use spot instances for chaos tests. We can also experiment with other tests that don't suppose to have dead nodes, but let's do it once the nightly infra is stabilized	2022-04-06 15:56:31 -07:00
Eric Liang	5001e46324	[GH] Trim the github issue templates to make them more developer friendly (#23718 ) Currently, the github issue templates are quite verbose, and as a result few developers end up using them. Clean them up so that they can be the standard workflow for all Ray contributors.	2022-04-06 15:49:43 -07:00
Kai Fricke	d27e73f851	[ci] Pin prometheus_client to fix current test outages (#23749 ) What: Pins prometheus_client to < 0.14.0, hopefully fixing today's CI outages Why: New version of the python client (https://github.com/prometheus/client_python/releases) breaks our CI	2022-04-06 14:22:22 -07:00
Amog Kamsetty	8becbfa927	[Train] MLflow start run under correct experiment (#23662 ) Start Mlflow run under correct mlflow experiment	2022-04-06 11:50:32 -07:00
Avnish Narayan	fdc6e02c29	[RLlib; testing] Move `num_workers` to RLlib config (#23750 )	2022-04-06 20:06:48 +02:00
Kai Fricke	0b804e5162	[ci/release] Move ML long running tests to sdk file manager (#23745 ) What: Long running tests should use sdk file manager Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.	2022-04-06 10:50:49 -07:00
Siyuan (Ryans) Zhuang	46465abd6d	[workflow] Deprecate "workflow.step" [Part 2 - most nested workflows] (#23728 ) * remove workflow.step * convert examples	2022-04-06 00:47:43 -07:00
Kai Fricke	c0e38e335c	Revert "Revert "[air] Better exception handling"" (#23733 ) This reverts commit `5609f438dc`.	2022-04-05 21:45:24 -07:00
Kai Fricke	7cf89dd686	[ci] Non-verbose llvm download in Buildkite (#23731 ) What: Use wget -nv in Buildkite environments Why: The llvm download currently clutters the log output as it's not rendered correctly, thus we should silence it. Result: Logs are finally readable again in Buildkite without download: https://buildkite.com/ray-project/ray-builders-pr/builds/28916#25e8965a-d18b-49a1-8e29-200365b13c53	2022-04-05 21:41:51 -07:00
Kai Fricke	5609f438dc	Revert "[air] Better exception handling (#23695 )" (#23732 ) This reverts commit `fb50e0a70b`.	2022-04-05 20:20:40 -07:00
xwjiang2010	99f64821b1	[tune] add tuner test (#23726 ) Adds test for TorchTrainer+Tuner	2022-04-05 19:42:51 -07:00
Kai Fricke	fb50e0a70b	[air] Better exception handling (#23695 ) What: Raise meaningful exceptions when invalid parameters are passed. Why: We want to catch invalid parameters and guide users to use the API in the correct way.	2022-04-05 19:11:55 -07:00
Antoni Baum	252596af58	[AIR] Add `config` to `Result`, extend `ResultGrid.get_best_config` (#23698 ) Adds a dynamic property to easily obtain `config` dict from `Result`, extends the `ResultGrid.get_best_config` method for parity with `ExperimentAnalysis.get_best_trial` (allows for using of mode and metric different to the one set in the Tuner).	2022-04-05 16:08:05 -07:00
Stephanie Wang	9813f2cce4	[datasets] Unify Datasets primitives on a common shuffle op (#23614 ) Currently Datasets primitives repartition, groupby, sort, and random_shuffle all use different internal shuffle implementations. This PR unifies them on a single internal ShuffleOp class. This class exposes static methods for map and reduce which must be implemented by the specific higher-level primitive. Then the ShuffleOp.execute method implements a simple pull-based shuffle by submitting one map task per input block and one reduce task per output block. Closes #23593.	2022-04-05 15:53:28 -07:00
Kai Fricke	dc994dbb02	[tune] Add RemoteTask based sync client (#23605 ) If rsync/ssh is not available (as in kubernetes setups), Tune previously had no fallback mechanism to synchronize trial directories to the driver. This PR introduces a `RemoteTaskSyncer` trial syncer that uses ray remote tasks to ship file contents between nodes. The implementation utilizes tarfile to compress files for transfer, and it only transfers files that differ between the source and target directory to minimize network bandwidth usage. The trial syncer works as follows: 1. It collects information about existing files in the target directory. This directory could be remote (when syncing up) or local (when syncing down). 2. It then schedules a `pack` task on the source node. This will always be a remote task so the future can be passed to the unpack task. The pack task will only pack files that are not existent or different in the target directory into a tarfile, which is returned as a bytes string 3. An `unpack` task in scheduled on the target node. This will always be a remote task so the future can be awaited in a call to `wait()` A test is added to ensure that only modified files are transferred on subsequent sync ups/downs. Finally, minor changes are made to the `Syncer`/`NodeSyncer` classes to allow passing `(ip, path)` tuples rather than rsync-style remote paths.	2022-04-05 21:35:25 +01:00
Archit Kulkarni	582bf4e8f8	Add basic jobs release test with Tune script (#23474 ) Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests: (1) Uses a local `working_dir` (2) Uses a remote working_dir from a zip github URL.	2022-04-05 13:31:11 -05:00
Chris K. W	9b79048963	Update error message for @ray.method (#23471 ) Updates @ray.method error message to match the one for @ray.remote. Since the client mode version of ray.method is identical to the regular ray.method, deletes the client mode version and drops the client_mode_hook decorator (guessing that the client copy was added before client_mode_hook was introduced). Also fixes what I'm guessing is a bug that doesn't allow both num_returns and concurrency_group to be specified at the same time (assert len(kwargs) == 1). Closes #23271	2022-04-05 11:12:55 -07:00
Stephanie Wang	1c972d5d2d	[core] Spill at least the object fusion size instead of at most (#22750 ) Copied from #22571: Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, currently we spill at most the object fusion size when instead we should be spilling at least the fusion size. Then we use the max number of fused objects as a cap. This PR fixes the fusion behavior so that we always spill at minimum the fusion size. If we reach the end of the spillable objects, and we are under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved. Increases some test timeouts to allow tests to pass.	2022-04-05 10:57:42 -07:00
Antoni Baum	ca6dfc8bb7	[AIR] Interface for `HuggingFaceTorchTrainer` (#23615 ) Initial draft of the interface for HuggingFaceTorchTrainer. One alternative for limiting the number of datasets in datasets dict would be to have the user pass train_dataset and validation_dataset as separate arguments, though that would be inconsistent with TorchTrainer.	2022-04-05 10:32:13 -07:00
liuyang-my	bdd3b9a0ab	[Serve] Unified Controller API for Cross Language Client (#23004 )	2022-04-05 09:20:02 -07:00
Sven Mika	434265edd0	[RLlib] Examples folder: All `training_iteration` translations. (#23712 )	2022-04-05 16:33:50 +02:00
jon-chuang	9c950e8979	[Core] Placement Group: Fix Flakey Test placement_group_test_5 and Typo (#23350 ) placement_group_test_5 is flakey. Reason is requesting PG with exact object store memory as node. If object store has object, then PG scheduling fails. Also fix bug - typo.	2022-04-05 05:33:43 -07:00
Gagandeep Singh	11baa22c1e	Split test_advanced_n.py and enabled cluster tests (#23524 )	2022-04-05 01:34:57 -07:00
Gagandeep Singh	8c87117bc3	Uniformly distributed tasks among actors to utilize full concurrency (#23416 ) * Uniformly distributed tasks among actors to utilize full concurrency * Added test to ensure all tasks are launched at the same time * Applied linting format	2022-04-05 01:05:41 -07:00
Matti Picus	96948a4a30	WINDOWS: skip flaky test (#23557 ) Continuation of #23462 to try to get test_ray_init to pass consistently in CI. The skipped test passes locally, so only skip it on CI.	2022-04-05 00:56:43 -07:00
Steven Morad	39841b65b3	[RLlib] PPOTorchPolicy: Remove extra call to `model.value_function` (#23671 )	2022-04-05 08:40:29 +02:00
mesjou	e725472b5b	[RLlib] Fix bug in prisoners dillemma example. (#23690 )	2022-04-05 08:36:20 +02:00
Jiajun Yao	5f37231842	Remove yapf dependency (#23656 ) Yapf has been replaced by black.	2022-04-04 21:50:04 -07:00
Clark Zinzow	08159eb668	[Datasets] Disallow callable classes for task compute strategy. (#23708 )	2022-04-04 21:12:36 -07:00
Yi Cheng	99ca8ee8e4	[flaky] Deflaky `ray_syncer_test` (#23703 ) ``` src/ray/common/test/ray_syncer_test.cc:495: Failure \| Expected: (s1.GetNumConsumedMessages(s2.syncer->GetLocalNodeID())) < (max_sends * 2 + 3), actual: 5 vs 5 ``` This is measuring number of request send. For extreme case, they should equal. This PR fixed this.	2022-04-04 19:38:58 -07:00
Siyuan (Ryans) Zhuang	ae86fb258e	[workflow] Fix workflow continuation resolving (#23682 ) * update test * return StaticWorkflowRef * reformat test	2022-04-04 17:39:24 -07:00
Amog Kamsetty	4530349506	[AIR] Set name of Trainable to match with Trainer #23697	2022-04-04 16:23:21 -07:00
matthewdeng	a12f5ff5d6	[train] add FAQ (#22757 ) Adding a FAQ page. Currently has some basic questions that have come up in the past. Explaining how to use Matplotlib due to threading in the distributed training function.	2022-04-04 16:14:35 -07:00
Jiajun Yao	a668e5d8db	Add perf metrics for stress tests (#23648 ) Added perf metrics for stress tests so they can be alerted on.	2022-04-05 08:09:27 +09:00
shrekris-anyscale	4aaa895137	[runtime_env] Reorganize tests in test_runtime_env_working_dir_2.py and test_runtime_env_working_dir_3.py (#23618 )	2022-04-04 17:35:49 -05:00
Kai Fricke	99a2aa013f	[ci] Remove existing artifacts pre-command with docker (#23655 ) Previously, pre-existing artifacts were not deleted pre-command because of permission issues. This can be fixed by running the remove command in another docker container. Seems to work well here: https://buildkite.com/ray-project/ray-builders-pr/builds/28683#322c7a9d-cba7-4c23-8b00-7ebc6144a777	2022-04-04 15:22:04 -07:00
Kai Fricke	b3b1498eba	[tune] Beautify Optional typehints (#23692 ) What: Changes `Union[None, type1, ..., typeN]` type hints to `Optional[type1, ..., typeN]` Why: Better readability, consistency across library, consistency with code style guides.	2022-04-04 19:48:34 +01:00
Edward Oakes	09123e3452	[serve][minor] Remove "statuses" key from `serve status` output (#23642 )	2022-04-04 11:11:26 -05:00

1 2 3 4 5 ...

12007 commits