hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Author	SHA1	Message	Date
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Stephanie Wang	d97afb9e60	[data] Pin pipeline executor actors to the driver node (#22715 ) DatasetPipeline execution is coordinated by a pool of actors and optionally the driver process. To recover from failures with lineage reconstruction, we need to keep these actors alive as long as the driver is alive. Currently, they are spread randomly throughout the cluster, so they can be killed during a node failure. This PR pins the actors to the same node as the driver so that they will survive any other node failures. It's also okay if the driver node dies, since the driver itself will also die.	2022-03-01 18:06:14 -08:00
Dmitri Gekhtman	4acbf36453	[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688 ) Closes #21353 and fixes an issue that causes dashboard to read K8s CPU requests rather than resources when determining CPUs available.	2022-03-01 17:15:59 -08:00
Eric Liang	06d4444b4a	Never re-use task workers for actors or GPU tasks (#22482 ) Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit. This adds a slight performance penalty for actor startup.	2022-03-01 16:46:18 -08:00
Eric Liang	e228544d39	Undo revert of windowing dataset by bytes (#22735 )	2022-03-01 12:24:04 -08:00
Archit Kulkarni	127b69bc21	[runtime env] Fix protobuf serialization/deserialization (#22672 ) This PR fixes some minor bugs in `to_dict` and `from_dict` for the runtime env protobuf and adds a test to cover this codepath. The test checks that `to_dict` and `from_dict` are inverses. This PR contains all fixes required to make the test pass.	2022-03-01 12:34:50 -06:00
Kenneth	9b67cb5a6f	Add buffering to object spilling (#22618 ) This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes. ### Performance Changes A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling. \| Run \| Produce (s) \| Consume (s) \| Total (s) \| \| -- \| -- \| -- \| -- \| \| Baseline (original) \| 347.332281 \| 355.611272 \| 705.560750 \| \| Baseline (w/ fix) \| 181.815852 \| 347.692850 \| 532.847759 \| \| No fusing (original) \| 453.574554 \| 525.047998 \| 981.620108 \| \| No fusing (w/ fix) \| 452.614848\| 519.787698 \| 975.412639 \| The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled. See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-01 10:13:10 -08:00
Eric Liang	482b0117e8	Basic log observability for spilling (#22612 )	2022-03-01 09:40:51 -08:00
Daniel	8d1f1b0a64	[RLlib] Update pettingzoo==1.15.0 supersuit==3.3.3 (#22519 )	2022-03-01 11:23:27 +01:00
Simon Mo	0bab8dbfe0	[Serve] Add test for controller managing Java Replica (#22628 )	2022-02-28 23:13:56 -08:00
Jian Xiao	aeb0a0dcbe	Add a static factory method to BlockBuilder to instantiate concrete builders (#22634 ) This is useful in combining multiple applied groups produced by groupby().map_groups() into a single one. For example, builder = BlockBuilder.for_block(type(batch)), and then for each applied group, builder.add_block(applied_group).	2022-02-28 19:00:24 -08:00
Simon Mo	00935275ae	[Serve] Autoscaling: basic intelligent scale down (#22669 )	2022-02-28 20:46:06 -06:00
shrekris-anyscale	49ee443231	[serve] Add Serve CLI commands for REST API (#22648 )	2022-02-28 20:45:46 -06:00
Jian Xiao	7597f1590b	[Dataset] fix some comments (#22700 )	2022-02-28 17:13:43 -08:00
Chris K. W	fa6b3c7c89	[aws][autoscaler] fix regional default AMI's (#22506 ) The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.	2022-02-28 15:52:57 -08:00
Clark Zinzow	cf3577f0ee	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
Simon Mo	fe3d501d68	[Core] Include java worker log with log monitor (#22629 )	2022-02-28 12:30:04 -08:00
SangBin Cho	ba4f1423c7	Revert "Support creating a DatasetPipeline windowed by bytes (#22577 )" (#22695 ) This reverts commit `b5b4460932`.	2022-02-28 11:56:12 -08:00
Jiaxin Shan	82daf2b041	[KubeRay] Remove configmap reference in example (#22688 ) A follow up change of #22348 example is not up to date and we can not bring up the cluster due to missing configmap. Autoscaler is able to convert CR to autoscaler config so we don't need configmap anymore.	2022-02-28 10:13:08 -08:00
SangBin Cho	08374e8af4	Revert "[core] Fix bug in fusion for spilled objects (#22571 )" (#22694 ) Makes 2 tests flaky	2022-02-28 10:11:14 -08:00
Kai Fricke	e84e967932	[ml] Add basic Ray ML interfaces (#22436 ) This PR adds the basic shared Ray ML interfaces.	2022-02-28 13:16:40 +01:00
Jialing He	aa1885ae2a	[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588 ) I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads. [The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.	2022-02-28 17:33:13 +08:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
mopga	6f68c74a5d	Use GPUtil for gpu detection when available (#18938 ) In Envs with K8S and enabled SELinux there is a bug: "/proc/nvidia/" is not allowed to mount in container So, i made a rework for GPU detection based on GPutil package. ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Release tests Co-authored-by: Mopga <a14415641@cab-wsm-0010669.sigma.sbrf.ru> Co-authored-by: Julius <juliustfrost@gmail.com>	2022-02-27 14:54:35 -08:00
Max Pumperla	372c620f58	[docs] Tune overhaul part II (#22656 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-26 23:07:34 -08:00
Jiao	25d60d9cc9	[3/X][Pipeline] Handle deployment handle replacement in DeploymentNode init args, support nested (#22646 ) - Moved all `Deployment` instance creation to `DeploymentNode` level with only relevant info passed into it from `generate.py`. This abstraction makes more sense and less leaky. - In `DeploymentNode`, we leverage ray core DAG's `_PyObjScanner` to find and replace only Deployment nodes init args & kwargs to deployment handle, which is only specific to `Deployment` instance, but not `DeploymentNode` itself. However this is the simplest and most robust way to handle nested args at `DAGNode` level. - This implementation lives in ray core DAGNode level so we don't need to expose `_PyObjScanner` directly. - Added serve pipeline tests to BUILD CI.	2022-02-26 09:57:59 -06:00
Eric Liang	b5b4460932	Support creating a DatasetPipeline windowed by bytes (#22577 )	2022-02-25 23:31:10 -08:00
Eric Liang	ae16aa1dba	Add some sanity checks for memory use in dataset (#22642 )	2022-02-25 16:59:12 -08:00
Simon Mo	4bf587f7ff	[Serve] make client poll more frequently (#22666 )	2022-02-25 14:56:18 -08:00
Stephanie Wang	0da541bb71	[core] Fix bug in fusion for spilled objects (#22571 ) Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, the current logic always spills once we reach the end of the spillable objects or once we've reached the fusion threshold. This can produce lots of unfused objects if they are created concurrently with the spill. This PR changes the spill logic: once we reach the end of the spillable objects, if the last batch of spilled objects is under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.	2022-02-25 13:24:05 -08:00
Sven Mika	7b687e6cd8	[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544 )	2022-02-25 21:58:16 +01:00
xwjiang2010	62b2c26041	[tune] increase timeout for ray_trial_executor_test. (#22658 )	2022-02-25 08:39:19 -08:00
Antoni Baum	d5284a740c	[tune] Remove `Trainable.update_resources` (#22471 )	2022-02-25 08:38:34 -08:00
xwjiang2010	d4a1bc7bc7	Revert "[runtime env] runtime env inheritance refactor (#22244 )" (#22626 ) Breaks train_torch_linear_test.py.	2022-02-25 08:42:30 -06:00
shrekris-anyscale	e85540a1a2	[serve] Expose deployment statuses in REST API (#22611 )	2022-02-25 08:41:07 -06:00
Dmitri Gekhtman	b2b442297e	[autoscaler] Fix initialization artifacts (#22570 ) This PR fixes initializations artifacts related to the load metric summary and autoscaler summary. Load metrics summaries are defined to be Falsey if the autoscaler has never received a resource message from the GCS. We skip most autoscaler actions if load metrics is Falsey, because it doesn't makes sense to autoscale without load metrics. This also allows us to execute the TODO here: #22348 (comment) and remove the time.wait(). As for the autoscaler summary, it is possible for autoscaler.summary() to error outside of an autoscaler update in this scenario: The very first call to NodeProvider.non_terminated_nodes fails, self.non_terminated_nodes remains a None object, and autoscaler.summary() fails trying to get an attribute of this None object. The result is a confusing error message, as in #22515. This PR fixes that. Closes #22515	2022-02-24 20:05:44 -08:00
Simon Mo	bfb619a127	[xlang] Allow Python to call overloaded methods with differing number of parameters (#21410 )	2022-02-24 16:51:38 -08:00
Jiao	3c707f70cc	[2/X][Pipeline] Add python generation for ClassNode (#22617 ) - Added backbone of ray dag -> serve dag transformation and deployment extraction. - Added util functions for deployment unique name generation .. ray_actor_options, replacement of DeploymentNode with deployment handle, etc.	2022-02-24 16:01:35 -06:00
Eric Liang	533a0440a6	Improve actor pool support in Datasets (#22574 )	2022-02-24 12:01:36 -08:00
Amog Kamsetty	02cb974c6c	[Train] Fix fault tolerance for Tensorflow (#22508 ) Soft restarts don't work for tensorflow since there is still some leftover communication state in the actors which may lead to undefined behavior, such as causing training to hang. Instead, this PR changes the failure handling for tensorflow to match torch and horovod, and recreates all the workers in case of failure. Also adds a test to check if fault tolerance works correctly for an actual tensorflow example. When testing locally, the test failed before the change, but passes after.	2022-02-24 11:50:20 -08:00
Simon Mo	b8c28d1f2b	[Tune] Make sure tune.run can run inside worker thread (#22566 )	2022-02-24 09:50:42 -08:00
shrekris-anyscale	a9ede4e499	[serve] Add REST API (#22578 ) This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.	2022-02-24 10:00:26 -06:00
Liu Bao	6a9a28612c	[runtime env] Async pip runtime env (#22381 ) In order to initialize runtime env concurrently, this PR makes pip runtime env asynchronous. It includes, - [x] New `check_output_cmd` in runtime env utils. - [x] Async PipProcessor. - [x] The `asynccontextmanager` from `https://github.com/python-trio/async_generator` for Python 3.6 - [x] Remove pip runtime env lock. - [x] Disable pip cache. Co-authored-by: 刘宝 <po.lb@antfin.com>	2022-02-24 11:03:40 +08:00
Eric Liang	e15a419028	Enable stage fusion by default for dataset pipelines (#22476 ) This PR enables stage fusion for dataset pipelines. This also requires: 1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage. 2. Removing spread_resource_prefix (not supported for now).	2022-02-23 17:34:05 -08:00
Eric Liang	a62a9c38fb	Fix [Bug] Splitting Dataset when shards < n hangs or errors (#22559 )	2022-02-23 15:54:25 -08:00
Edward Oakes	5a21289a34	[runtime_env] Remove get_current_runtime_env from docs (#22594 ) We should just encourage people to use the existing `get_runtime_context` API instead of introducing a new one here. Just removing the docs for now while we discuss this.	2022-02-23 16:53:52 -06:00
Eric Liang	fc75d17701	Fix [Bug] DatasetPipeline .iter_epochs() can lead to infinite loops (#22572 )	2022-02-23 13:35:31 -08:00
Siyuan (Ryans) Zhuang	f6f0fea102	Symlink workflow for development (#22554 )	2022-02-23 13:14:05 -08:00
Siyuan (Ryans) Zhuang	2e0186a5b6	[workflow] Checkpoint API (#19406 ) checkpoint API * ensure commit_step only do checkpointing	2022-02-23 13:09:08 -08:00
Chris K. W	3371e78d2e	[client] Chunk PutRequests (#22327 ) Why are these changes needed? Data from PutRequests is chunked into 64MiB messages over the datastream, to avoid the 2GiB message size limit from gRPC. This will allow users to transfer objects larger than 2GiB over the network. Proto changes Put requests now have fields for chunk_id to identify which chunk data belongs to, total_chunks to identify the total number of chunks in the object, and total_size for total size in bytes of the object (useful for raising warnings). PutObject is still unary-unary. The dataservicer handles reassembling the chunks before passing the result to the underlying RayletServicer. Dataclient changes If a put request is inserted into the request queue, self._requests will chunk it lazily. Doing this lazily is important since inserting all of the chunks onto the request queue immediately would double the amount of memory needed to handle a large request. This also guarantees that the chunks of a given putrequest will be contiguous Dataservicer changes The dataservicer now maintains some state to track received chunks. Once all chunks for a putrequest are received, the combined chunks are passed to the raylet servicer.	2022-02-23 18:21:25 +02:00

1 2 3 4 5 ...

6159 commits