hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	a9bf5e9e2f	[ci] Update GPU docker image to Ubuntu 20.04 (#22759 ) This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.	2022-03-02 10:28:26 +01:00
Max Pumperla	7d4296c72f	run code in browser (#22727 ) Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand. If this seems useful we can extend this to other examples gradually. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-02 10:27:00 +01:00
Chen Shen	3e3db8e9cd	[scheduler] hide StringIDMap under BaseSchedulingID (#22722 ) * add * address comments	2022-03-01 22:50:53 -08:00
Yi Cheng	271ed44143	[2][resource reporting] Encapsulate poller and broadcaster into syncer in gcs (#22464 ) This PR move the poller and broadcaster from gcs server to ray syncer. TODO in next PR: deprecate the code path of placement group resource reporting and move the broadcaster out of gcs cluster resource manager.	2022-03-01 21:51:14 -08:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Stephanie Wang	d97afb9e60	[data] Pin pipeline executor actors to the driver node (#22715 ) DatasetPipeline execution is coordinated by a pool of actors and optionally the driver process. To recover from failures with lineage reconstruction, we need to keep these actors alive as long as the driver is alive. Currently, they are spread randomly throughout the cluster, so they can be killed during a node failure. This PR pins the actors to the same node as the driver so that they will survive any other node failures. It's also okay if the driver node dies, since the driver itself will also die.	2022-03-01 18:06:14 -08:00
Dmitri Gekhtman	4acbf36453	[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688 ) Closes #21353 and fixes an issue that causes dashboard to read K8s CPU requests rather than resources when determining CPUs available.	2022-03-01 17:15:59 -08:00
Eric Liang	06d4444b4a	Never re-use task workers for actors or GPU tasks (#22482 ) Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit. This adds a slight performance penalty for actor startup.	2022-03-01 16:46:18 -08:00
Eric Liang	5a0b7a7ee0	Document Dataset pipeline stage fusion (#22737 )	2022-03-01 14:38:09 -08:00
Eric Liang	1a170f7234	[RFC] Disable actor queueing warning for concurrent actors (#22720 ) The warning was not implemented properly for out of order actors. Disable it for now.	2022-03-01 14:28:19 -08:00
Sven Mika	0af100ffae	[RLlib] Fix tree.flatten dict ordering bug: `flatten_space([obs_space])` should produce same struct as `tree.flatten([obs])`. (#22731 )	2022-03-01 21:24:24 +01:00
Eric Liang	e228544d39	Undo revert of windowing dataset by bytes (#22735 )	2022-03-01 12:24:04 -08:00
Archit Kulkarni	127b69bc21	[runtime env] Fix protobuf serialization/deserialization (#22672 ) This PR fixes some minor bugs in `to_dict` and `from_dict` for the runtime env protobuf and adds a test to cover this codepath. The test checks that `to_dict` and `from_dict` are inverses. This PR contains all fixes required to make the test pass.	2022-03-01 12:34:50 -06:00
Kenneth	9b67cb5a6f	Add buffering to object spilling (#22618 ) This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes. ### Performance Changes A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling. \| Run \| Produce (s) \| Consume (s) \| Total (s) \| \| -- \| -- \| -- \| -- \| \| Baseline (original) \| 347.332281 \| 355.611272 \| 705.560750 \| \| Baseline (w/ fix) \| 181.815852 \| 347.692850 \| 532.847759 \| \| No fusing (original) \| 453.574554 \| 525.047998 \| 981.620108 \| \| No fusing (w/ fix) \| 452.614848\| 519.787698 \| 975.412639 \| The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled. See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-01 10:13:10 -08:00
Eric Liang	482b0117e8	Basic log observability for spilling (#22612 )	2022-03-01 09:40:51 -08:00
Edward Oakes	2a09561edf	[serve] Enable REST API tests with main clause (#22706 ) Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>	2022-03-01 11:21:22 -06:00
Sven Mika	e50bd212a1	[RLlib] Disable flakey Pendulum-v1 tests (until further investigation). (#22686 )	2022-03-01 16:44:17 +01:00
Daniel	8d1f1b0a64	[RLlib] Update pettingzoo==1.15.0 supersuit==3.3.3 (#22519 )	2022-03-01 11:23:27 +01:00
simonsays1980	568cf28dd4	[RLlib] Example script `custom_metrics_and_callbacks.py` should work for `batch_mode=complete_episodes`. (#22684 )	2022-03-01 09:00:38 +01:00
Jun Gong	e8be45065e	[RLlib] Restore policies on `eval_workers` as well. (#22641 )	2022-03-01 08:38:14 +01:00
Simon Mo	0bab8dbfe0	[Serve] Add test for controller managing Java Replica (#22628 )	2022-02-28 23:13:56 -08:00
Kai Fricke	d06c3ffd6f	[release] Migrate Tune + XGBoost tests to new infrastructure (#22705 ) Migrate XGBoost and Tune tests to new release testing infrastructure. https://buildkite.com/ray-project/release-tests-branch/builds/50	2022-03-01 08:10:06 +01:00
Chen Shen	7b22d662df	[clean up ClusterResourceScheduler 2/n] Introduce random policy in the scheduling policy #22712	2022-02-28 20:38:55 -08:00
Chen Shen	dfcb0f5de5	[clean up ClusterResourceScheduler 1/n] move IsSchedulable logic into ClusterResourceManager #22711	2022-02-28 20:37:56 -08:00
Jian Xiao	aeb0a0dcbe	Add a static factory method to BlockBuilder to instantiate concrete builders (#22634 ) This is useful in combining multiple applied groups produced by groupby().map_groups() into a single one. For example, builder = BlockBuilder.for_block(type(batch)), and then for each applied group, builder.add_block(applied_group).	2022-02-28 19:00:24 -08:00
Simon Mo	00935275ae	[Serve] Autoscaling: basic intelligent scale down (#22669 )	2022-02-28 20:46:06 -06:00
shrekris-anyscale	49ee443231	[serve] Add Serve CLI commands for REST API (#22648 )	2022-02-28 20:45:46 -06:00
Stephanie Wang	73f078236f	[doc] Update docs about actor garbage collection (#20763 ) Update outdated actor docs about when actors are GCed.	2022-02-28 18:45:29 -08:00
Jian Xiao	7597f1590b	[Dataset] fix some comments (#22700 )	2022-02-28 17:13:43 -08:00
Jiaxin Shan	32829ff9ad	[KubeRay] Provide a new Dockerfile for fast build (#22689 ) Adds a new Dockerfile for fast build and development of KubeRay.	2022-02-28 17:09:16 -08:00
Archit Kulkarni	85657b1377	[Doc] [Jobs] add CLI and SDK reference to docs (#22680 )	2022-02-28 17:57:46 -06:00
Chris K. W	fa6b3c7c89	[aws][autoscaler] fix regional default AMI's (#22506 ) The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.	2022-02-28 15:52:57 -08:00
jon-chuang	3bc0858a4f	[Core/GCS] remove default 100 concurrent rate limit for heartbeat (#22613 ) better scalability Closes https://github.com/ray-project/ray/issues/20773	2022-02-28 15:26:05 -08:00
SangBin Cho	2c1184592e	mark threaded actor test unstable (#22696 )	2022-02-28 15:25:14 -08:00
Clark Zinzow	cf3577f0ee	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
Chen Shen	7e90700521	[Dataset][nighly-test] promote data ingestion test to stable #22702	2022-02-28 14:00:18 -08:00
Simon Mo	fe3d501d68	[Core] Include java worker log with log monitor (#22629 )	2022-02-28 12:30:04 -08:00
Kai Fricke	3695408a85	[release] Fix special cases in release test package (e.g. smoke test) (#22442 ) Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.	2022-02-28 21:05:01 +01:00
SangBin Cho	ba4f1423c7	Revert "Support creating a DatasetPipeline windowed by bytes (#22577 )" (#22695 ) This reverts commit `b5b4460932`.	2022-02-28 11:56:12 -08:00
Jiaxin Shan	82daf2b041	[KubeRay] Remove configmap reference in example (#22688 ) A follow up change of #22348 example is not up to date and we can not bring up the cluster due to missing configmap. Autoscaler is able to convert CR to autoscaler config so we don't need configmap anymore.	2022-02-28 10:13:08 -08:00
SangBin Cho	08374e8af4	Revert "[core] Fix bug in fusion for spilled objects (#22571 )" (#22694 ) Makes 2 tests flaky	2022-02-28 10:11:14 -08:00
Kai Fricke	e84e967932	[ml] Add basic Ray ML interfaces (#22436 ) This PR adds the basic shared Ray ML interfaces.	2022-02-28 13:16:40 +01:00
Jialing He	aa1885ae2a	[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588 ) I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads. [The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.	2022-02-28 17:33:13 +08:00
Jun Gong	22bc451102	[RLlib] Fix a memeory leak in SimpleReplyBuffer that completely kills sampling throughput (#22678 )	2022-02-28 09:28:04 +01:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
Qing Wang	9572bb717f	[RuntimeEnv] Support setting actor level env vars for Java worker (#22240 ) This PR supports setting actor level env vars for Java worker in runtime env. General API looks like: ```java RuntimeEnv runtimeEnv = new RuntimeEnv.Builder() .addEnvVar("KEY1", "A") .addEnvVar("KEY2", "B") .addEnvVar("KEY1", "C") // This overwrites "KEY1" to "C" .build(); ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote(); ``` If `num-java-workers-per-process` > 1, it will never reuse the worker process except they have the same runtime envs. Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-02-28 10:58:37 +08:00
Lingxuan Zuo	94caac8722	Remove exporting symbols (#22623 ) To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects. Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported. More details in https://github.com/ray-project/ray/pull/22526 Mobius has applied this change in https://github.com/ray-project/mobius/pull/28. Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>	2022-02-28 09:41:10 +08:00
mopga	6f68c74a5d	Use GPUtil for gpu detection when available (#18938 ) In Envs with K8S and enabled SELinux there is a bug: "/proc/nvidia/" is not allowed to mount in container So, i made a rework for GPU detection based on GPutil package. ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Release tests Co-authored-by: Mopga <a14415641@cab-wsm-0010669.sigma.sbrf.ru> Co-authored-by: Julius <juliustfrost@gmail.com>	2022-02-27 14:54:35 -08:00
Max Pumperla	372c620f58	[docs] Tune overhaul part II (#22656 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-26 23:07:34 -08:00
Jiao	25d60d9cc9	[3/X][Pipeline] Handle deployment handle replacement in DeploymentNode init args, support nested (#22646 ) - Moved all `Deployment` instance creation to `DeploymentNode` level with only relevant info passed into it from `generate.py`. This abstraction makes more sense and less leaky. - In `DeploymentNode`, we leverage ray core DAG's `_PyObjScanner` to find and replace only Deployment nodes init args & kwargs to deployment handle, which is only specific to `Deployment` instance, but not `DeploymentNode` itself. However this is the simplest and most robust way to handle nested args at `DAGNode` level. - This implementation lives in ray core DAGNode level so we don't need to expose `_PyObjScanner` directly. - Added serve pipeline tests to BUILD CI.	2022-02-26 09:57:59 -06:00

... 2 3 4 5 6 ...

11633 commits