hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Siyuan (Ryans) Zhuang	b621dc099b	[DAG] Update the example in the doc (#22930 ) * update doc	2022-03-08 20:09:45 -08:00
Guyang Song	56287d63e5	[runtime env] remove _rewrite_pip_list_ray_libraries (#22890 ) We don't need this logic after using virtualenv.	2022-03-09 11:41:33 +08:00
Edward Oakes	aa907987bf	[serve][release tests] Use m5.8xlarge instance types for 1k replica tests (#22918 )	2022-03-08 21:34:01 -06:00
Stephanie Wang	bf09f5071a	[core] Deflake test_plasma_unlimited (#22911 ) test_plasma_unlimited::test_task_unlimited is flaky because one of the assertions is race-y and can trigger after the condition is no longer true (see #22883). This fixes the flake by: - adding an assertion in between two object allocations to force the object store queue to flush - keeping one of the ObjectRefs in scope to make sure that the object is still fallback-allocated by the time we reach the failing assertion	2022-03-08 22:00:04 -05:00
Chen Shen	bc3f7a7684	[scheduling policy 3/n][rfc] Refactor SchedulingPolicy into interface and implementations (#22907 ) * scheduling policy * update Co-authored-by: Gagandeep Singh <gdp.1807@gmail.com>	2022-03-08 18:47:56 -08:00
Junwen Yao	0395d0987e	[Train] Add support for automatic pipelining of host to device transfer (#22716 ) This PR adds the support for concurrently transferring the input from host to device.	2022-03-08 18:37:23 -08:00
Balaji Veeramani	48af260aaf	[Train] Clarify shuffle documentation in `prepare_data_loader` (#22876 ) We essentially use a hack to determine whether shuffling should be enabled in prepare_data_loader. I've clarified the documentation so the hack is easier to understand.	2022-03-08 18:13:29 -08:00
Alex Wu	b84aaef38a	Promote python 3.9 support to stable (#22923 ) Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-03-08 17:24:54 -08:00
SangBin Cho	549527687f	Migrate scalability tests (#22901 ) This PR migrates scalability tests to the new infra. I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke	2022-03-08 17:22:41 -08:00
Eric Liang	52491c87e2	Make a pass fixing Dataset API issues (#22886 )	2022-03-08 13:07:55 -08:00
shrekris-anyscale	ab2741d64b	[serve] Support `working_dir` in `serve run` (#22760 ) #22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.	2022-03-08 13:18:41 -06:00
Junwen Yao	d1009c8489	[Train] Add support for metrics aggregation (#22099 ) This PR allows users to aggregate metrics returned from all workers.	2022-03-08 11:03:04 -08:00
Simon Mo	c8aa6cdf64	Fix Issue Severity Question to Bug Report Template (#22906 )	2022-03-08 10:36:32 -08:00
Wendi-anyscale	dd8654fd85	Add Issue Severity Question to Bug Report Template (#22887 )	2022-03-08 10:31:53 -08:00
Balaji Veeramani	37c6169027	[Train] Refactor and add `Accelerator` classes (#22009 ) To support mixed precision (see #20643), we need to store a GradScaler instance that is accessibly by both prepare_optimizer and backward functions (these functions will be added later). This PR introduces the Accelerator, an object that implements methods to perform backend-specific training optimizations.	2022-03-08 10:26:00 -08:00
Balaji Veeramani	04b10ff9e9	[Train] Tell user to specify cluster address if placement group times out (#22845 ) If you don't add `ray.init("auto")` to your training script, then your training script might complain that there aren't enough resources, even if `ray status` shows that there are. Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2022-03-08 10:24:12 -08:00
matthewdeng	7b5813e94f	[ml] add initial Dataset Preprocessors (#22748 )	2022-03-08 09:59:03 -08:00
Kai Fricke	c57abb693b	[ci/release] Add frequency to core nightly test (#22905 ) Breaks the scheduled build: https://buildkite.com/ray-project/release-tests-branch/builds/82#3994f5e1-6da3-4c70-8c30-bdcfb1fec851 We should enforce schema validation soon.	2022-03-08 17:44:20 +00:00
Gagandeep Singh	2899dc1bb5	Fixed MRO for `DerivedActorClass` (#22113 ) Comments to be noted from the discussion below, https://github.com/ray-project/ray/pull/22113#discussion_r802512907 > Problem - We cannot always delegate call to cls.__init__ or modified_cls.__init__. Because if always delegate call to cls.__init__ from here, then user defined class's __init__ method will be ignore leading to issues like, https://github.com/ray-project/ray/issues/21868. If we always delegate call to modified_cls.__init__ then it will allow inheriting from actor classes leading to failure of test_actor_inheritance. So, I have added this if-else check to figure out which __init__ method should be called. If "__module__", "__qualname__" and "__init__" are present in args[-1] then it would mean an actor class is being inherited so cls.__init__ should be called. However, if no such signal is received in args then user defined class's __init__ i.e., modified_class.__init__ should be called. https://github.com/ray-project/ray/pull/22113#discussion_r808696261 > So I noted that ActorClass.__init__ will anyway raise a TypeError whenever it will be inherited. To exactly figure out whether the exception is due to inheritance of ActorClass, I created a new class ActorClassInheritanceException(TypeError). Now, whenever this will be raised, then DerivedActorClass will get a clear signal about inheritance of ActorClass. In other cases, it will be safe to conclude (AFAICT) that user called __init__ method of their class and we will proceed normally. IMHO, this is a better and more robust solution which just depends on a simple signal i.e., raising a particular exception in a specific event. It doesn't matter how inheritance is prevented as in the end we just need to raise ActorClassInheritanceException and all other code will be able to detect that easily. https://github.com/ray-project/ray/pull/22113#issuecomment-1048527387	2022-03-08 09:37:19 -08:00
Chen Shen	cd0354e06d	[scheduling-policy 2/n] refactor scheduling policy API (#22885 ) * add scheduling-options * address comments	2022-03-08 09:29:00 -08:00
ZhuSenlin	1e4d7bc1f4	[Core] make StringIdMap thread safe (#22893 ) * make StringIdMap thread safe * fix comment Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-08 09:23:41 -08:00
xwjiang2010	f5995dccdf	[tune] Trainables will now know TUNE_ORIG_WORKING_DIR (#22803 ) Also updated the docs.	2022-03-08 15:56:30 +00:00
Artur Niederfahrenhorst	37d129a965	[RLlib] ReplayBuffer API: Test cases. (#22390 )	2022-03-08 16:54:12 +01:00
Max Pumperla	d6bff736f3	[docs] test ray.io snippets (#22822 ) Tests all snippets we have on ray.io. There were some minor issues, which I'll fix upstream. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-08 15:50:57 +00:00
SangBin Cho	0137fc8e23	[Tests] Add microbenchmark to the new infra test (#22861 ) Verified it works. It also addresses the frequency comments from the previous PR	2022-03-08 05:58:49 -08:00
Artur Niederfahrenhorst	c0ade5f0b7	[RLlib] Issue 22625: `MultiAgentBatch.timeslices()` does not behave as expected. (#22657 )	2022-03-08 14:25:48 +01:00
Tao Wang	4576f53fe3	[HOTFIX]fix some compilation failures in core worker test (#22855 ) There're some compilation failures in core worker test when we build project using `bazel build //:all`. It seems broken and not integrated in CI.	2022-03-08 16:14:14 +08:00
Qing Wang	9aa0b4e89e	[Java] Add transient for cached hashcode of IDs to reduce serialized size. (#22766 ) Use `transient` keyword for reducing the serialized size of ids for transporting.	2022-03-08 14:36:08 +08:00
Jiajun Yao	7f57268bd0	Fix duplidate test bazel target (#22892 )	2022-03-08 14:29:13 +09:00
Jiajun Yao	4801e57c77	[Test] Add missing tests to bazel BUILD (#22827 )	2022-03-07 19:54:49 -08:00
Jian Xiao	c2908de401	For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema (#22834 ) There is a bug in combining the results from map_batches: if we create two dataset out of the same data, but with different num of partitions, we may get different results when run the same map_batches() on them. That is, num of partitions is affecting the map_batches() results, which should not.	2022-03-07 18:17:49 -08:00
mwtian	3f4a59c506	[Core] clean up pubsub to prepare for refactor (#22819 ) To prepare for additional changes in pubsub to fix #22339 and #22340, - Use structs instead of std::pair to hold per-subscription data, in case we need to expand the data fields. - Rename variables in tests to indicate non-object pubsub testing. - Pass full request to long poll handler in Publisher. - Simplify logic when possible. There should be no behavior change. Most of the code changes are based on #20276	2022-03-07 17:21:04 -08:00
Chen Shen	fbdf3e96f2	[scheduling-policy 1/n] pass check-node-liveness by constructor #22880	2022-03-07 16:55:29 -08:00
Jiajun Yao	2302b4eea8	Stop and join actor asyncio threads during exit (#22810 )	2022-03-07 14:45:08 -08:00
Stephanie Wang	cb218d03b9	[core] Enable lineage reconstruction by default (#22816 ) Enables lineage reconstruction, which allows automatic recovery of task outputs, by default. Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).	2022-03-07 17:40:30 -05:00
Stephanie Wang	fa14120f93	Move tests out of test_object_spilling to de-flake (#22831 ) This test is timing out often in debug_mode, so moved some tests to test_object_spilling_3.	2022-03-07 17:39:55 -05:00
SangBin Cho	529911ee78	[Nightly tests] Add missing patches (#22862 ) These changes are added to the old e2e.py, but not to the new infra	2022-03-07 19:48:43 +00:00
SangBin Cho	79e8405fda	Revert "[GCS] refactor the resource related data structures on the GCS (#22817 )" (#22863 ) This reverts commit `549466a42f`.	2022-03-07 08:48:17 -08:00
shrekris-anyscale	15d97a1021	[serve] Support `init_args` and `init_kwargs` in `serve run` (#22805 ) #22714 added `serve run` to the Serve CLI. This change allows the user to specify `init_args` and `init_kwargs` in `serve run` if they are deploying via import path.	2022-03-07 09:45:17 -06:00
Jiajun Yao	1b5efb588e	[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844 ) This's easier to sort and compare timestamp and avoid timezone issue.	2022-03-07 04:54:19 -08:00
Akash Patel	5ebc32d7c2	[Core] Update grpc to 1.44.0 (#22384 ) Updates grpc to 1.44.0 to remove local patch needed for grpc to build. EDIT: there have been changes to how python is found (mostly removing python2 support) and as such the local python-patch we have for grpc needs to be modified. This time contributing it to upstream (grpc/grpc#28895) so that it'll get added in a newer version! For anyone that comes across this: Here is the error itself for why we need the grpc-python.patch file: https://buildkite.com/ray-project/ray-builders-pr/builds/24659#d293616f-225d-41f9-8de2-03780f12b13f/2386-2416	2022-03-07 04:53:48 -08:00
ZhuSenlin	549466a42f	[GCS] refactor the resource related data structures on the GCS (#22817 )	2022-03-07 18:43:33 +08:00
SangBin Cho	9d0148dbbe	[Test] Migrate the first test to the new infra (#22770 ) This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test	2022-03-06 18:24:54 -08:00
shrekris-anyscale	2490b3e383	[serve] Enable serve-decorated deployment via import path (#22839 ) Currently, classes and functions can be deployed by setting `Deployment`'s`func_or_class` to their import path. However, if these classes or functions are already decorated with `@serve.deployment`, the import path deployment will error. This change instead ignores the settings in a class or function's `@serve.deployment` decorator when deploying via import path. It takes the code definition and deploys it without erroring. It also logs a warning about the ignored settings.	2022-03-06 20:03:57 -06:00
shrekris-anyscale	521298e093	[serve] Make route prefix the deployment name by default (#22840 ) The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted. This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access. This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.	2022-03-06 20:03:31 -06:00
Jiao	2d2b5745ae	[5/X][Pipeline][Ray DAG] Make Ray InputNode more powerful with attr accessor (#22793 ) - Enhanced ray dag InputNode to take arbitrary user input via `.execute()`. - If only one value is provided, like `dag.execute(1)`, return raw value; - Otherwise wrap user input into an `DAGInputData` object that can be accessed via index or key. - User can also pass list / dict object and just access them via index [0] or key ["key"] - Introduced `InputAttrNode` that helps to connect partial attribute of user input to the DAG. - Added context manager syntax for `InputNode`. - Add InputNode enforcements with tests, such as DAG level singleton, exception with messages, etc. - Enforce only simple int or str key - Take care of JSON serialization for InputNode that carried original context manager info, ensure it's preserved. - DAGNode UUID is also preserved in JSON serde. ## Next steps On ray dag level we're proceeding with ``` with InputNode() as input: # Probably better to rename it to DAGInput() a = Model.bind(input[0]) b = Model.bind(input.x) dag = combine.bind(a, b) ``` But also enforces 1) InputNode is always used in context manager as opposed to directly created 2) There should be one and only one InputNode instance for each dag. 3) No args passed by user to InputNode at ray dag level. Then in serve we subclass a ServeInputNode() to enhance it like the following to support HTTP input validation and conversion: ``` with ServeInputNode(schema=MySchemaCls) as input: a = Model.bind(input[0]) b = Model.bind(input.x) dag = combine.bind(a, b) ``` ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2022-03-06 20:02:42 -06:00
mwtian	f67ff312a8	run mac c++ tests with static linking (#22829 ) There are problems with running C++ tests in MacOS 10.15 Catalina, when upgrading to the newest grpc due to dynamic linking: #22384 (comment). The problem does not exist for Python tests in Catalina, or in C++ tests of other systems. Upgrading MacOS CI from Catalina is also blocked in the short term: ray-project/buildkite-ci-stack#24 (comment) So working around the issue by using static linking for C++ tests on Mac.	2022-03-05 10:39:32 +09:00
Clark Zinzow	3d63313265	[Datasets] Batch across windows in DatasetPipelines. (#22830 ) This PR allows `DatasetPipeline.iter_batches()` to batch data across windows in the pipeline. This prevents partial batches from popping up in the middle of consuming a dataset pipeline due to window boundaries, and now allows us to provide the following guarantee to the user: `pipe.iter_batches()` will yield `len(pipe) // batch_size` full batches, with a partial batch occurring only (1) as the final batch and (2) only if `len(pipe) % batch_size > 0`, and if it exists, will have size `len(pipe) % batch_size`. The crux of this PR takes the block batching implementation from `Dataset.iter_batches()`, refactors it to operate on an iterator of blocks instead of a `Dataset` and pulls it out into a shared `batch_blocks()` utility, and have `DatasetPipeline.iter_batches()` use it to batch over windows by providing an iterator over all blocks in all windows.	2022-03-04 16:26:44 -08:00
Jiajun Yao	23f2862067	[Release Test] Send release test result to db pipeline for new test infra (#22813 ) * Send release test result to db pipeline for new test infra * address comment	2022-03-05 07:34:40 +09:00
Yi Cheng	5bbbfac5e8	[gcs] Fix resource updating incorrectly (#22644 ) When there is no scheduling task of scheduling class in local raylet, the backlog resource will not be reported. It usually will happen when core worker try to schedule the task on other node and report backlog to local node. This will lead to the wrong demands.	2022-03-04 14:32:54 -08:00

... 2 3 4 5 6 ...

11709 commits