hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
ZhuSenlin	549466a42f	[GCS] refactor the resource related data structures on the GCS (#22817 )	2022-03-07 18:43:33 +08:00
shrekris-anyscale	2490b3e383	[serve] Enable serve-decorated deployment via import path (#22839 ) Currently, classes and functions can be deployed by setting `Deployment`'s`func_or_class` to their import path. However, if these classes or functions are already decorated with `@serve.deployment`, the import path deployment will error. This change instead ignores the settings in a class or function's `@serve.deployment` decorator when deploying via import path. It takes the code definition and deploys it without erroring. It also logs a warning about the ignored settings.	2022-03-06 20:03:57 -06:00
shrekris-anyscale	521298e093	[serve] Make route prefix the deployment name by default (#22840 ) The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted. This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access. This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.	2022-03-06 20:03:31 -06:00
Jiao	2d2b5745ae	[5/X][Pipeline][Ray DAG] Make Ray InputNode more powerful with attr accessor (#22793 ) - Enhanced ray dag InputNode to take arbitrary user input via `.execute()`. - If only one value is provided, like `dag.execute(1)`, return raw value; - Otherwise wrap user input into an `DAGInputData` object that can be accessed via index or key. - User can also pass list / dict object and just access them via index [0] or key ["key"] - Introduced `InputAttrNode` that helps to connect partial attribute of user input to the DAG. - Added context manager syntax for `InputNode`. - Add InputNode enforcements with tests, such as DAG level singleton, exception with messages, etc. - Enforce only simple int or str key - Take care of JSON serialization for InputNode that carried original context manager info, ensure it's preserved. - DAGNode UUID is also preserved in JSON serde. ## Next steps On ray dag level we're proceeding with ``` with InputNode() as input: # Probably better to rename it to DAGInput() a = Model.bind(input[0]) b = Model.bind(input.x) dag = combine.bind(a, b) ``` But also enforces 1) InputNode is always used in context manager as opposed to directly created 2) There should be one and only one InputNode instance for each dag. 3) No args passed by user to InputNode at ray dag level. Then in serve we subclass a ServeInputNode() to enhance it like the following to support HTTP input validation and conversion: ``` with ServeInputNode(schema=MySchemaCls) as input: a = Model.bind(input[0]) b = Model.bind(input.x) dag = combine.bind(a, b) ``` ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2022-03-06 20:02:42 -06:00
Clark Zinzow	3d63313265	[Datasets] Batch across windows in DatasetPipelines. (#22830 ) This PR allows `DatasetPipeline.iter_batches()` to batch data across windows in the pipeline. This prevents partial batches from popping up in the middle of consuming a dataset pipeline due to window boundaries, and now allows us to provide the following guarantee to the user: `pipe.iter_batches()` will yield `len(pipe) // batch_size` full batches, with a partial batch occurring only (1) as the final batch and (2) only if `len(pipe) % batch_size > 0`, and if it exists, will have size `len(pipe) % batch_size`. The crux of this PR takes the block batching implementation from `Dataset.iter_batches()`, refactors it to operate on an iterator of blocks instead of a `Dataset` and pulls it out into a shared `batch_blocks()` utility, and have `DatasetPipeline.iter_batches()` use it to batch over windows by providing an iterator over all blocks in all windows.	2022-03-04 16:26:44 -08:00
Yi Cheng	5bbbfac5e8	[gcs] Fix resource updating incorrectly (#22644 ) When there is no scheduling task of scheduling class in local raylet, the backlog resource will not be reported. It usually will happen when core worker try to schedule the task on other node and report backlog to local node. This will lead to the wrong demands.	2022-03-04 14:32:54 -08:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Eric Liang	80aac655ca	Fix flaky metric test (#22809 )	2022-03-03 20:44:50 -08:00
Siyuan (Ryans) Zhuang	d72350bfe6	[workflow] Fix different step directories are used for "workflow.wait" during recovery (#22782 ) * add test	2022-03-03 16:37:50 -08:00
Jian Xiao	b933587597	Support map_groups in dataset (#22709 ) Make Dataset capable of running map_groups(), i.e. apply a UDF on each group after a groupby() operation.	2022-03-03 15:14:00 -08:00
mwtian	55166f0780	Revert "Revert "Disable scheduler_report_pinned_bytes_only (#22132 )" (#22786 )" (#22808 ) This reverts commit `b98c9c77f1`.	2022-03-03 12:32:28 -08:00
shrekris-anyscale	71a493cf1f	[serve] Add run, delete, and status to Serve CLI (#22714 ) This change adds `run`, `delete`, and `status` commands to the CLI introduced in #22648. * `serve run`: Blocking command that allows users to deploy a YAML configuration or a class/function via import path. When terminated, the deployment(s) is torn down. Prints status info while running. Supports interactive development. * `serve delete`: Shuts down a Serve application and deletes all its running deployments. * `serve status`: Displays the status of a Serve application's deployments.	2022-03-03 09:50:36 -06:00
Jiao	76dc4ccbfd	[4/X][Pipeline] JSON serialization for serve dag nodes (#22710 ) Added JSON serde for all DAGNode types needed with tests on ray core dag as well as serve dag. See code inline comments for behavior and assumption for each.	2022-03-03 09:49:43 -06:00
Dmitri Gekhtman	991a62dd47	Operator does not retry monitor on failure. (#22792 )	2022-03-02 23:37:03 -08:00
Jialing He	207d93a52c	[runtime env] Make env_vars take effect when pip install packages (#22730 ) Previously, for the stability of pip installation, we set env to empty, but when pip installs some gzip package, maybe need env_vars. like this issue: https://github.com/ray-project/ray/issues/22610	2022-03-02 21:47:34 -06:00
mwtian	b98c9c77f1	Revert "Disable scheduler_report_pinned_bytes_only (#22132 )" (#22786 ) This reverts commit `88d2e21585`.	2022-03-02 18:29:31 -08:00
Dmitri Gekhtman	a8d8d0e1a6	Fix K8s API (#22756 ) This PR fixes K8s support by updating the api client used for ingresses.	2022-03-02 09:59:16 -08:00
Jiajun Yao	440732f267	Fix mac osx worker process not being killed by ray stop (#22758 ) For mac osx, setproctitle doesn't change the process name returned by psutil (I think it's this issue https://github.com/dvarrazzo/py-setproctitle/issues/10) but only cmdline so we need to filter by cmdline instead.	2022-03-02 09:02:48 -08:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Stephanie Wang	d97afb9e60	[data] Pin pipeline executor actors to the driver node (#22715 ) DatasetPipeline execution is coordinated by a pool of actors and optionally the driver process. To recover from failures with lineage reconstruction, we need to keep these actors alive as long as the driver is alive. Currently, they are spread randomly throughout the cluster, so they can be killed during a node failure. This PR pins the actors to the same node as the driver so that they will survive any other node failures. It's also okay if the driver node dies, since the driver itself will also die.	2022-03-01 18:06:14 -08:00
Dmitri Gekhtman	4acbf36453	[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688 ) Closes #21353 and fixes an issue that causes dashboard to read K8s CPU requests rather than resources when determining CPUs available.	2022-03-01 17:15:59 -08:00
Eric Liang	06d4444b4a	Never re-use task workers for actors or GPU tasks (#22482 ) Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit. This adds a slight performance penalty for actor startup.	2022-03-01 16:46:18 -08:00
Eric Liang	e228544d39	Undo revert of windowing dataset by bytes (#22735 )	2022-03-01 12:24:04 -08:00
Archit Kulkarni	127b69bc21	[runtime env] Fix protobuf serialization/deserialization (#22672 ) This PR fixes some minor bugs in `to_dict` and `from_dict` for the runtime env protobuf and adds a test to cover this codepath. The test checks that `to_dict` and `from_dict` are inverses. This PR contains all fixes required to make the test pass.	2022-03-01 12:34:50 -06:00
Kenneth	9b67cb5a6f	Add buffering to object spilling (#22618 ) This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes. ### Performance Changes A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling. \| Run \| Produce (s) \| Consume (s) \| Total (s) \| \| -- \| -- \| -- \| -- \| \| Baseline (original) \| 347.332281 \| 355.611272 \| 705.560750 \| \| Baseline (w/ fix) \| 181.815852 \| 347.692850 \| 532.847759 \| \| No fusing (original) \| 453.574554 \| 525.047998 \| 981.620108 \| \| No fusing (w/ fix) \| 452.614848\| 519.787698 \| 975.412639 \| The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled. See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-01 10:13:10 -08:00
Eric Liang	482b0117e8	Basic log observability for spilling (#22612 )	2022-03-01 09:40:51 -08:00
Daniel	8d1f1b0a64	[RLlib] Update pettingzoo==1.15.0 supersuit==3.3.3 (#22519 )	2022-03-01 11:23:27 +01:00
Simon Mo	0bab8dbfe0	[Serve] Add test for controller managing Java Replica (#22628 )	2022-02-28 23:13:56 -08:00
Jian Xiao	aeb0a0dcbe	Add a static factory method to BlockBuilder to instantiate concrete builders (#22634 ) This is useful in combining multiple applied groups produced by groupby().map_groups() into a single one. For example, builder = BlockBuilder.for_block(type(batch)), and then for each applied group, builder.add_block(applied_group).	2022-02-28 19:00:24 -08:00
Simon Mo	00935275ae	[Serve] Autoscaling: basic intelligent scale down (#22669 )	2022-02-28 20:46:06 -06:00
shrekris-anyscale	49ee443231	[serve] Add Serve CLI commands for REST API (#22648 )	2022-02-28 20:45:46 -06:00
Jian Xiao	7597f1590b	[Dataset] fix some comments (#22700 )	2022-02-28 17:13:43 -08:00
Chris K. W	fa6b3c7c89	[aws][autoscaler] fix regional default AMI's (#22506 ) The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.	2022-02-28 15:52:57 -08:00
Clark Zinzow	cf3577f0ee	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
Simon Mo	fe3d501d68	[Core] Include java worker log with log monitor (#22629 )	2022-02-28 12:30:04 -08:00
SangBin Cho	ba4f1423c7	Revert "Support creating a DatasetPipeline windowed by bytes (#22577 )" (#22695 ) This reverts commit `b5b4460932`.	2022-02-28 11:56:12 -08:00
Jiaxin Shan	82daf2b041	[KubeRay] Remove configmap reference in example (#22688 ) A follow up change of #22348 example is not up to date and we can not bring up the cluster due to missing configmap. Autoscaler is able to convert CR to autoscaler config so we don't need configmap anymore.	2022-02-28 10:13:08 -08:00
SangBin Cho	08374e8af4	Revert "[core] Fix bug in fusion for spilled objects (#22571 )" (#22694 ) Makes 2 tests flaky	2022-02-28 10:11:14 -08:00
Kai Fricke	e84e967932	[ml] Add basic Ray ML interfaces (#22436 ) This PR adds the basic shared Ray ML interfaces.	2022-02-28 13:16:40 +01:00
Jialing He	aa1885ae2a	[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588 ) I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads. [The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.	2022-02-28 17:33:13 +08:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
mopga	6f68c74a5d	Use GPUtil for gpu detection when available (#18938 ) In Envs with K8S and enabled SELinux there is a bug: "/proc/nvidia/" is not allowed to mount in container So, i made a rework for GPU detection based on GPutil package. ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Release tests Co-authored-by: Mopga <a14415641@cab-wsm-0010669.sigma.sbrf.ru> Co-authored-by: Julius <juliustfrost@gmail.com>	2022-02-27 14:54:35 -08:00
Max Pumperla	372c620f58	[docs] Tune overhaul part II (#22656 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-26 23:07:34 -08:00
Jiao	25d60d9cc9	[3/X][Pipeline] Handle deployment handle replacement in DeploymentNode init args, support nested (#22646 ) - Moved all `Deployment` instance creation to `DeploymentNode` level with only relevant info passed into it from `generate.py`. This abstraction makes more sense and less leaky. - In `DeploymentNode`, we leverage ray core DAG's `_PyObjScanner` to find and replace only Deployment nodes init args & kwargs to deployment handle, which is only specific to `Deployment` instance, but not `DeploymentNode` itself. However this is the simplest and most robust way to handle nested args at `DAGNode` level. - This implementation lives in ray core DAGNode level so we don't need to expose `_PyObjScanner` directly. - Added serve pipeline tests to BUILD CI.	2022-02-26 09:57:59 -06:00
Eric Liang	b5b4460932	Support creating a DatasetPipeline windowed by bytes (#22577 )	2022-02-25 23:31:10 -08:00
Eric Liang	ae16aa1dba	Add some sanity checks for memory use in dataset (#22642 )	2022-02-25 16:59:12 -08:00
Simon Mo	4bf587f7ff	[Serve] make client poll more frequently (#22666 )	2022-02-25 14:56:18 -08:00
Stephanie Wang	0da541bb71	[core] Fix bug in fusion for spilled objects (#22571 ) Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, the current logic always spills once we reach the end of the spillable objects or once we've reached the fusion threshold. This can produce lots of unfused objects if they are created concurrently with the spill. This PR changes the spill logic: once we reach the end of the spillable objects, if the last batch of spilled objects is under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.	2022-02-25 13:24:05 -08:00
Sven Mika	7b687e6cd8	[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544 )	2022-02-25 21:58:16 +01:00
xwjiang2010	62b2c26041	[tune] increase timeout for ray_trial_executor_test. (#22658 )	2022-02-25 08:39:19 -08:00

1 2 3 4 5 ...

6177 commits