hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Daniel	8d1f1b0a64	[RLlib] Update pettingzoo==1.15.0 supersuit==3.3.3 (#22519 )	2022-03-01 11:23:27 +01:00
simonsays1980	568cf28dd4	[RLlib] Example script `custom_metrics_and_callbacks.py` should work for `batch_mode=complete_episodes`. (#22684 )	2022-03-01 09:00:38 +01:00
Jun Gong	e8be45065e	[RLlib] Restore policies on `eval_workers` as well. (#22641 )	2022-03-01 08:38:14 +01:00
Simon Mo	0bab8dbfe0	[Serve] Add test for controller managing Java Replica (#22628 )	2022-02-28 23:13:56 -08:00
Kai Fricke	d06c3ffd6f	[release] Migrate Tune + XGBoost tests to new infrastructure (#22705 ) Migrate XGBoost and Tune tests to new release testing infrastructure. https://buildkite.com/ray-project/release-tests-branch/builds/50	2022-03-01 08:10:06 +01:00
Chen Shen	7b22d662df	[clean up ClusterResourceScheduler 2/n] Introduce random policy in the scheduling policy #22712	2022-02-28 20:38:55 -08:00
Chen Shen	dfcb0f5de5	[clean up ClusterResourceScheduler 1/n] move IsSchedulable logic into ClusterResourceManager #22711	2022-02-28 20:37:56 -08:00
Jian Xiao	aeb0a0dcbe	Add a static factory method to BlockBuilder to instantiate concrete builders (#22634 ) This is useful in combining multiple applied groups produced by groupby().map_groups() into a single one. For example, builder = BlockBuilder.for_block(type(batch)), and then for each applied group, builder.add_block(applied_group).	2022-02-28 19:00:24 -08:00
Simon Mo	00935275ae	[Serve] Autoscaling: basic intelligent scale down (#22669 )	2022-02-28 20:46:06 -06:00
shrekris-anyscale	49ee443231	[serve] Add Serve CLI commands for REST API (#22648 )	2022-02-28 20:45:46 -06:00
Stephanie Wang	73f078236f	[doc] Update docs about actor garbage collection (#20763 ) Update outdated actor docs about when actors are GCed.	2022-02-28 18:45:29 -08:00
Jian Xiao	7597f1590b	[Dataset] fix some comments (#22700 )	2022-02-28 17:13:43 -08:00
Jiaxin Shan	32829ff9ad	[KubeRay] Provide a new Dockerfile for fast build (#22689 ) Adds a new Dockerfile for fast build and development of KubeRay.	2022-02-28 17:09:16 -08:00
Archit Kulkarni	85657b1377	[Doc] [Jobs] add CLI and SDK reference to docs (#22680 )	2022-02-28 17:57:46 -06:00
Chris K. W	fa6b3c7c89	[aws][autoscaler] fix regional default AMI's (#22506 ) The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.	2022-02-28 15:52:57 -08:00
jon-chuang	3bc0858a4f	[Core/GCS] remove default 100 concurrent rate limit for heartbeat (#22613 ) better scalability Closes https://github.com/ray-project/ray/issues/20773	2022-02-28 15:26:05 -08:00
SangBin Cho	2c1184592e	mark threaded actor test unstable (#22696 )	2022-02-28 15:25:14 -08:00
Clark Zinzow	cf3577f0ee	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
Chen Shen	7e90700521	[Dataset][nighly-test] promote data ingestion test to stable #22702	2022-02-28 14:00:18 -08:00
Simon Mo	fe3d501d68	[Core] Include java worker log with log monitor (#22629 )	2022-02-28 12:30:04 -08:00
Kai Fricke	3695408a85	[release] Fix special cases in release test package (e.g. smoke test) (#22442 ) Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.	2022-02-28 21:05:01 +01:00
SangBin Cho	ba4f1423c7	Revert "Support creating a DatasetPipeline windowed by bytes (#22577 )" (#22695 ) This reverts commit `b5b4460932`.	2022-02-28 11:56:12 -08:00
Jiaxin Shan	82daf2b041	[KubeRay] Remove configmap reference in example (#22688 ) A follow up change of #22348 example is not up to date and we can not bring up the cluster due to missing configmap. Autoscaler is able to convert CR to autoscaler config so we don't need configmap anymore.	2022-02-28 10:13:08 -08:00
SangBin Cho	08374e8af4	Revert "[core] Fix bug in fusion for spilled objects (#22571 )" (#22694 ) Makes 2 tests flaky	2022-02-28 10:11:14 -08:00
Kai Fricke	e84e967932	[ml] Add basic Ray ML interfaces (#22436 ) This PR adds the basic shared Ray ML interfaces.	2022-02-28 13:16:40 +01:00
Jialing He	aa1885ae2a	[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588 ) I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads. [The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.	2022-02-28 17:33:13 +08:00
Jun Gong	22bc451102	[RLlib] Fix a memeory leak in SimpleReplyBuffer that completely kills sampling throughput (#22678 )	2022-02-28 09:28:04 +01:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
Qing Wang	9572bb717f	[RuntimeEnv] Support setting actor level env vars for Java worker (#22240 ) This PR supports setting actor level env vars for Java worker in runtime env. General API looks like: ```java RuntimeEnv runtimeEnv = new RuntimeEnv.Builder() .addEnvVar("KEY1", "A") .addEnvVar("KEY2", "B") .addEnvVar("KEY1", "C") // This overwrites "KEY1" to "C" .build(); ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote(); ``` If `num-java-workers-per-process` > 1, it will never reuse the worker process except they have the same runtime envs. Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-02-28 10:58:37 +08:00
Lingxuan Zuo	94caac8722	Remove exporting symbols (#22623 ) To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects. Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported. More details in https://github.com/ray-project/ray/pull/22526 Mobius has applied this change in https://github.com/ray-project/mobius/pull/28. Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>	2022-02-28 09:41:10 +08:00
mopga	6f68c74a5d	Use GPUtil for gpu detection when available (#18938 ) In Envs with K8S and enabled SELinux there is a bug: "/proc/nvidia/" is not allowed to mount in container So, i made a rework for GPU detection based on GPutil package. ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Release tests Co-authored-by: Mopga <a14415641@cab-wsm-0010669.sigma.sbrf.ru> Co-authored-by: Julius <juliustfrost@gmail.com>	2022-02-27 14:54:35 -08:00
Max Pumperla	372c620f58	[docs] Tune overhaul part II (#22656 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-26 23:07:34 -08:00
Jiao	25d60d9cc9	[3/X][Pipeline] Handle deployment handle replacement in DeploymentNode init args, support nested (#22646 ) - Moved all `Deployment` instance creation to `DeploymentNode` level with only relevant info passed into it from `generate.py`. This abstraction makes more sense and less leaky. - In `DeploymentNode`, we leverage ray core DAG's `_PyObjScanner` to find and replace only Deployment nodes init args & kwargs to deployment handle, which is only specific to `Deployment` instance, but not `DeploymentNode` itself. However this is the simplest and most robust way to handle nested args at `DAGNode` level. - This implementation lives in ray core DAGNode level so we don't need to expose `_PyObjScanner` directly. - Added serve pipeline tests to BUILD CI.	2022-02-26 09:57:59 -06:00
Eric Liang	b5b4460932	Support creating a DatasetPipeline windowed by bytes (#22577 )	2022-02-25 23:31:10 -08:00
SangBin Cho	1cedb1b6e4	[Test] Increase timeout for microbenchmark (#22655 )	2022-02-25 17:29:12 -08:00
Eric Liang	ae16aa1dba	Add some sanity checks for memory use in dataset (#22642 )	2022-02-25 16:59:12 -08:00
Simon Mo	4bf587f7ff	[Serve] make client poll more frequently (#22666 )	2022-02-25 14:56:18 -08:00
Stephanie Wang	0da541bb71	[core] Fix bug in fusion for spilled objects (#22571 ) Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, the current logic always spills once we reach the end of the spillable objects or once we've reached the fusion threshold. This can produce lots of unfused objects if they are created concurrently with the spill. This PR changes the spill logic: once we reach the end of the spillable objects, if the last batch of spilled objects is under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.	2022-02-25 13:24:05 -08:00
Sven Mika	7b687e6cd8	[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544 )	2022-02-25 21:58:16 +01:00
Edward Oakes	65313621ab	[codeowners] Swap joeybai for edoakes as snapshot codeowner (#22660 )	2022-02-25 12:55:07 -06:00
Archit Kulkarni	31332f8930	[serve] [release tests] Add health check grace period for 1k deployment (#22651 )	2022-02-25 12:13:44 -06:00
shrekris-anyscale	8548affdc2	Increase `test_failed_job_status` timeout in `test_job_submission` (#22643 ) `test_job_submission` has become [flakey](https://flakey-tests.ray.io/) due to timeout. This change increases the timeout in `test_failed_job_status` from 10 to 25 seconds.	2022-02-25 10:08:55 -08:00
Stephanie Wang	634ca9afdb	[core] Cleanup handling for nondeterministic object size during transfer (#22639 ) Currently object transfers assume that the object size is fixed. This is a bad assumption during failures, especially with lineage reconstruction enabled and tasks with nondeterministic outputs. This PR cleans up the handling and hopefully guards against two cases where the object size may change during a transfer: 1. The object manager's size information does not match the object in the local plasma store (due to async notifications). --> the object manager overwrites its own information if it finds that the physical object has a different size. 2. The receiver's created buffer size does not match the sender's object size. --> the receiver destroys the previous buffer and creates a new buffer with the correct size. This might cause some transient errors but eventually object transfer should succeed. Unfortunately I couldn't trigger this from Python because it depends on some pretty specific timing conditions. However, I did add some unit tests for case 2 (this is the majority of the PR).	2022-02-25 09:39:14 -08:00
xwjiang2010	62b2c26041	[tune] increase timeout for ray_trial_executor_test. (#22658 )	2022-02-25 08:39:19 -08:00
Antoni Baum	d5284a740c	[tune] Remove `Trainable.update_resources` (#22471 )	2022-02-25 08:38:34 -08:00
xwjiang2010	d4a1bc7bc7	Revert "[runtime env] runtime env inheritance refactor (#22244 )" (#22626 ) Breaks train_torch_linear_test.py.	2022-02-25 08:42:30 -06:00
shrekris-anyscale	e85540a1a2	[serve] Expose deployment statuses in REST API (#22611 )	2022-02-25 08:41:07 -06:00
Chen Shen	89aaa79ee9	[resource scheduler] unify the GetBestSchedulableNode into one public method. (#22560 ) * clean up cluster resource scheduler * address comments * always prioritize local node when spill back waiting tasks * address comments	2022-02-25 01:09:21 -08:00
Dmitri Gekhtman	b2b442297e	[autoscaler] Fix initialization artifacts (#22570 ) This PR fixes initializations artifacts related to the load metric summary and autoscaler summary. Load metrics summaries are defined to be Falsey if the autoscaler has never received a resource message from the GCS. We skip most autoscaler actions if load metrics is Falsey, because it doesn't makes sense to autoscale without load metrics. This also allows us to execute the TODO here: #22348 (comment) and remove the time.wait(). As for the autoscaler summary, it is possible for autoscaler.summary() to error outside of an autoscaler update in this scenario: The very first call to NodeProvider.non_terminated_nodes fails, self.non_terminated_nodes remains a None object, and autoscaler.summary() fails trying to get an attribute of this None object. The result is a confusing error message, as in #22515. This PR fixes that. Closes #22515	2022-02-24 20:05:44 -08:00
Simon Mo	bfb619a127	[xlang] Allow Python to call overloaded methods with differing number of parameters (#21410 )	2022-02-24 16:51:38 -08:00

1 2 3 4 5 ...

11466 commits