hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Antoni Baum	4a15c6f8f3	[tune] Preparation for deadline schedulers (#22006 )	2022-02-22 11:05:28 -08:00
Matti Picus	dfe4706d73	re-remove unused opencv-python-headless (#22470 ) PR #16929 removed opencv-python-headless. PR #17158 added it back but did not use it. This was noted by [a reviewer](https://github.com/ray-project/ray/pull/17158#issuecomment-982976429) since it breaks python3.9 (no wheel is available for installation).	2022-02-22 09:45:30 -08:00
Gagandeep Singh	4de1886ad5	Unskipped tests in `test_object_spilling`, `test_object_spilling_2`, `test_get_locations` (#22208 ) Mostly cluster tests are enabled in this PR in the above mentioned files. Some non-cluster tests are also enabled. All of these pass on my machine without issues.	2022-02-22 09:41:26 -08:00
Sven Mika	6522935291	[RLlib] Slate-Q tf implementation and tests/benchmarks. (#22389 )	2022-02-22 09:36:44 +01:00
Jun Gong	2b6a0c71d7	[RLlib] Add a callback for when trainer finishes initialization: `on_trainer_init`. (#22493 )	2022-02-22 08:18:32 +01:00
Steven Morad	d4571741aa	[RLlib] `seq_lens` should always be torch tensors. (#22398 )	2022-02-22 08:15:43 +01:00
JYX	49d7ba3738	[RLlib] Fix typo in vector_env docstring (#22534 )	2022-02-22 08:13:50 +01:00
Daniel	308ccfe25c	[RLlib] DD-PPO move `train_batch_size==-1` check to __init__ (#22521 )	2022-02-21 11:44:12 +01:00
Guyang Song	902243fb03	[runtime env] support raylet sharing fate with agent (#22382 ) - Remove the agent restart feature. - Raylet shares fate with agent to make the failover logic easier. Refer to issue https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528	2022-02-21 18:16:21 +08:00
Guyang Song	5783cdb254	[runtime env] runtime env inheritance refactor (#22244 ) Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc): - The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment. - Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent. We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run. Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior: - If there is no runtime env option when we create actor, inherit the parent runtime env. - Otherwise, use the optional runtime env directly and don't do the merging. Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like: ```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))``` This new API also can be used in ray client.	2022-02-21 18:13:22 +08:00
Gagandeep Singh	3cb85859cd	Unskipped tests for Windows (#21702 ) This set of tests passes without issues on Windows for me, so unskipping them here.	2022-02-20 11:48:59 -08:00
Max Pumperla	29d94a2211	[docs] sphinx gallery removal, migrate to ipynb (#22467 )	2022-02-19 01:19:07 -08:00
Clark Zinzow	76e8247d4d	[Datasets] Force local metadata resolution when unserializable `Partitioning` object provided. (#22477 )	2022-02-18 21:21:34 -08:00
Amog Kamsetty	04feea4afe	[rllib] Upper bound `gym` version (#22510 ) gym had 0.22 release today which is breaking a lot of the rllib tests and examples. Temporarily pins gym version for now.	2022-02-18 17:39:22 -08:00
Jiajun Yao	6a17653ba7	API stability annotations for ray commands (#22420 ) Annotate ray commands that are intended to be public.	2022-02-18 17:13:36 -08:00
Guyang Song	57a94aae12	[runtime env][bugfix] Fix runtime env retry (#22495 ) - Bug: `error_message` is not cleared when the retry succeeds. This bug lead to runtime env creation failing. - Add test case for this.	2022-02-18 17:09:06 -08:00
Archit Kulkarni	8c12e30f11	[Doc] Add actor max restarts default value to fault tolerance doc (#22481 )	2022-02-18 17:48:22 -06:00
Jiajun Yao	baa14d695a	Round robin during spread scheduling (#21303 ) - Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently. - Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later. - Prefer not to spill back tasks that are waiting for args since the pull is already in progress.	2022-02-18 15:05:35 -08:00
mwtian	5a4c6d2e88	[Core] release GIL when running `parallel_memcopy()` / `memcpy()` during serializations (#22492 ) While investigating #22161, it is observed GIL is held for an extended amount of time (up to 1000s) with stack trace [1]. It is possible either there are many iterations within `Pickle5Writer.write_to()` calling `ray::parallel_memcopy()`, or a few `ray::parallel_memcopy()` taking a long time (less likely). Either way, `ray::parallel_memcopy()` or `std::memcpy()` should not hold GIL.	2022-02-18 14:11:12 -08:00
Yi Cheng	95256181dd	[1][resource reporting] Remove redis based resource broadcasting. (#22463 ) This flag has been turned on by default for almost 4 months. Delete the old code so that when refactoring, we don't need to take care of the legacy code path.	2022-02-18 14:09:37 -08:00
Stephanie Wang	03a5589591	[core] Enable lineage reconstruction in CI (#21519 ) Enables lineage reconstruction in all CI and release tests.	2022-02-18 11:04:20 -08:00
Max Pumperla	9482f03134	[docs] RLlib concepts consolidation, user guide, RL conf prep (#22496 )	2022-02-18 09:35:20 -08:00
Jun Gong	04effca29c	[RLlib; docs] Update README.rst to fix the broken RLlib logo (#22489 )	2022-02-18 18:33:07 +01:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	1f160114a0	[serve] [CI] change serve:test_runtime_env from medium to large (#22474 ) This test was timing out occasionally.	2022-02-18 08:50:47 -06:00
ZhuSenlin	3341fae573	[Core] remove unused method GcsResourceManager::UpdateResourceCapacity (#22462 ) In the implementation of `GcsResourceManager::UpdateResourceCapacity`, 'cluster_scheduling_resources_' is modified, but this method is only used in c++ unit test, it is easy to cause confuse when reading the code. Since this method can be completely replaced by `GcsResourceManager::OnNodeAdd`, just remove it. Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-02-18 13:35:47 +08:00
Archit Kulkarni	df85d31095	[Serve] Make handle serializable (#22473 )	2022-02-17 17:29:44 -08:00
ZhuSenlin	15cccd0286	[Core] Fix null pointer crash when GcsResourceManager::SetAvailableResources (#22459 ) * fix null pointer crash when GcsResourceManager::SetAvailableResources * add warning log when node does not exist * add unit test Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-02-17 17:18:30 -08:00
Simon Mo	3e7511e84f	[CI] Disable privileged test (#22484 )	2022-02-17 15:34:02 -08:00
Chen Shen	17f589a05d	[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479	2022-02-17 15:20:39 -08:00
Ian Rodney	c9a4b17f99	[YAMLs] Fix comments about autoscaler round-robining (#22002 )	2022-02-17 13:59:05 -08:00
Sven Mika	c58cd90619	[RLlib] Enable Bandits to work in batches mode(s) (vector envs + multiple workers + train_batch_sizes > 1). (#22465 )	2022-02-17 22:32:26 +01:00
SangBin Cho	4ecb2afc2c	[State] Add pid to the actor table data. (#22434 ) It is requested by users that they'd like to get the pid of actors using ray.state.actors. This PR addresses that.	2022-02-17 06:22:29 -08:00
Avnish Narayan	740def0a13	[RLlib] Put env-checker on critical path. (#22191 )	2022-02-17 14:06:14 +01:00
Sven Mika	e03606f0b3	[RLlib] Bandit documentation enhancements. (#22427 )	2022-02-17 13:25:50 +01:00
Chen Shen	ab53848dfc	[refactor cluster-task-manage 4/n] refactor cluster_task_manager into distributed and local part (#21660 ) This is a working in progress PR that splits cluster_task_manager into local and distributed parts. For the distributed scheduler (cluster_task_manager_: /// Schedules a task onto one node of the cluster. The logic is as follows: /// 1. Queue tasks for scheduling. /// 2. Pick a node on the cluster which has the available resources to run a /// task. /// * Step 2 should occur any time the state of the cluster is /// changed, or a new task is queued. /// 3. For tasks that's infeasible, put them into infeasible queue and reports /// it to gcs, where the auto scaler will be notified and start new node /// to accommodate the requirement. For the local task manager: /// It Manages the lifetime of a task on the local node. It receives request from /// cluster_task_manager (the distributed scheduler) and does the following /// steps: /// 1. Pulling task dependencies, add the task into to_dispatch queue. /// 2. Once task's dependencies are all pulled locally, the task becomes ready /// to dispatch. /// 3. For all tasks that are dispatch-ready, we schedule them by acquiring /// local resources (including pinning the objects in memory and deduct /// cpu/gpu and other resources from local resource manager), and start /// a worker. /// 4. If task failed to acquire resources in step 3, we will try to /// spill it to a different remote node. /// 5. When a worker finishes executing its task(s), the requester will return /// it and we should release the resources in our view of the node's state. /// 6. If a task has been waiting for arguments for too long, it will also be /// spilled back to a different node. ///	2022-02-17 01:14:33 -08:00
Eric Liang	786c5759de	[data] Stage fusion optimizations, off by default (#22373 ) This PR adds the following stage fusion optimizations (off by default). In a later PR, I plan to enable this by default for DatasetPipelines. - Stage fusion: Whether to fuse compatible OneToOne stages. - Read stage fusion: Whether to fuse read stages into downstream OneToOne stages. This is accomplished by rewriting the read stage (LazyBlockList) into a transformation over a collection of read tasks (BlockList -> MapBatches(do_read)). - Shuffle stage fusion: Whether to fuse compatible OneToOne stages into shuffle stages that support specifying a map-side block UDF. Stages are considered compatible if their compute strategy is the same ("tasks" vs "actors"), and they have the same Ray remote args. Currently, the PR is ignoring the remote args of read tasks, but this will be fixed as a followup (I didn't want to change the read tasks default here).	2022-02-16 21:08:27 -08:00
Yi Cheng	e10a2fbcf9	[workflow] Move `test_basic_workflows_2.py` to large test (#22416 ) test_basic_workflows_2.py timeout. Move it to the large test suite.	2022-02-16 17:05:02 -08:00
mwtian	05dd72101b	[Release 1.11.0] Release logs for 1.11.0rc1 (#22443 ) This is the release log for 1.11.0rc1, with GCS-Ray enabled. The diff is against 1.11.0rc0, without GCS-Ray.	2022-02-16 17:03:49 -08:00
Yi Cheng	83257a4193	Revert "[Client] chunked get requests" (#22455 ) Reverts ray-project/ray#22100 linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.	2022-02-16 16:43:43 -08:00
Chen Shen	30ec0df9cc	[placement group] fix pg benchmark regression #22441 We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.	2022-02-16 16:24:51 -08:00
Jun Gong	a9147bb62c	[Release Test] Fix AnyscaleSDK construction so we can run CI on staging instance. (#22325 )	2022-02-16 09:56:02 -08:00
SangBin Cho	42361a1801	[Test] Fix Dask on Ray 1 TB bug #22431 Open Fixes a bug. It seems like not df is not working with dataframe	2022-02-17 02:44:36 +09:00
Kai Fricke	331b71ea8d	[ci/release] Refactor release test e2e into package (#22351 ) Adds a unit-tested and restructured ray_release package for running release tests. Relevant changes in behavior: Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior). The main subpackages are: Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster Command runner: Runs commands, e.g. as client command or sdk command File manager: Uploads/downloads files to/from session Reporter: Reports results (e.g. to database) Much of the code base is unit tested, but there are probably some pieces missing. Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_ Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023	2022-02-16 17:35:02 +00:00
SangBin Cho	2ed5bb7a5f	[Nightly Test] Addressed client failure properly (#22438 ) When the client returns the code that's not 0, we should raise RuntimeError to properly propagate errors	2022-02-16 09:03:17 -08:00
Archit Kulkarni	606e2b2cde	Update license for MLflow's conda utils and virtualenv-clone (#22402 ) When we vendor third-party code, we should update LICENSE file. Previously we vendored two pieces of code: - conda utilities from MLflow - virtualenv-clone But we only included the attribution in the relevant source files, not in our LICENSE file. This PR adds the necessary info to our LICENSE file.	2022-02-16 10:00:23 -06:00
Jun Gong	04dd536987	[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426 )	2022-02-16 13:06:09 +01:00
Hao Chen	f2bbcf5adc	Fix test_traceback incompatibility with pytest 6.x (#22375 ) Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Kai Yang <kfstorm@outlook.com>	2022-02-16 18:04:19 +08:00
Qing Wang	7c45d1a366	[doc][Java] Add doc page for java concurrency group. (#21600 ) Add document page for Java concurrency group. Co-authored-by: Kai Yang <kfstorm@outlook.com>	2022-02-16 17:57:03 +08:00
Eric Liang	92550500bc	Split workflow and dataset tests (#22415 )	2022-02-16 01:47:55 -08:00

... 2 3 4 5 6 ...

11519 commits