hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
qicosmos	efef38f240	[C++ Worker] Add basic ref counting test cases (#17768 )	2021-10-29 11:22:19 +08:00
Jiajun Yao	760878f950	Handle empty dataset for sort and groupby (#19849 )	2021-10-28 18:49:33 -07:00
Simon Mo	0433281ec8	[CI] Bump Serve test_regression to medium for windows (#19844 )	2021-10-28 17:49:50 -07:00
Philipp Moritz	0633ae45e9	[Documentation] Remove note about windows wheels needing dev runtime (#19847 )	2021-10-28 16:59:58 -07:00
Kai Fricke	564d8551ed	[ci/release] only check alert if test succeeded before (#19857 )	2021-10-28 16:09:10 -07:00
Edward Oakes	42ac906313	[job submission] Support passing metadata to the JobConfig (#19845 )	2021-10-28 16:40:03 -05:00
SangBin Cho	9126810c41	[Usabiilty] Improve the serialization failure message (#19691 ) * Done * done * Done * fix test * Adressed code review. * done * done * fix mistake * Skip tests on windows	2021-10-28 14:25:51 -07:00
matthewdeng	bfb0ef1b08	move jsonschema to core dependencies and update default AutoscalerPrometheusMetrics (#19831 )	2021-10-28 13:04:22 -07:00
SangBin Cho	96fc875a89	[Core] Improve scheduling observability and fix wrong resource deadlock report message. (#19746 )	2021-10-28 11:42:21 -07:00
SangBin Cho	39486ef08c	[Core] Fix the resource leak if custom resources don't exist. #19837 Why are these changes needed? The current logic can cause resource leak if AllocateTaskResourceInstances is requested with the custom resources that don't exist in the local node. The original assumption was the caller will free resources when it returns false, but it is an error prone API, and it actually turns out that we don't do this anywhere. Related issue number Closes #17044	2021-10-28 11:00:34 -07:00
Amog Kamsetty	1803d88943	[Train] Simplify single worker training (#19814 ) * wip * update * fix * fix * fix * fix	2021-10-28 10:54:35 -07:00
shrekris-anyscale	6e6fff8857	[serve] Enable deployment of functions/classes that take no parameters (#19708 )	2021-10-28 12:53:44 -05:00
Jiao	ed0e2e4fd7	[job submission] Add job_config in subprocess driver script (#19765 )	2021-10-28 12:12:51 -05:00
gjoliver	d81885c1f1	[RLlib] Fix all the CI tests that were broken by is_training and replay buffer changes; re-comment-in the failing RLlib tests (#19809 ) * Fix DDPG, since it is based on GenericOffPolicyTrainer. * Fix QMix, SAC, and MADDPA too. * Undo QMix change. * Fix DQN input batch type. Always use SampleBatch. * apex ddpg should not use replay_buffer_config yet. * Make eager tf policy to use SampleBatch. * lint * LINT. * Re-enable RLlib broken tests to make sure things work ok now. * fixes. Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-10-28 18:06:47 +02:00
Jiajun Yao	fe8138bfc2	Listen to 127.0.0.1 if node ip is 127.0.0.1 (#19810 )	2021-10-28 08:44:23 -07:00
Simon Mo	5e927b01ad	Revert "[CI] Remove config that disables Bazel test result cache" (#19818 ) * Revert "[CI] Remove config that disables Bazel test result cache (#18701)" This reverts commit `098ff36faa`. * Remove all RLlib tests from BUILD that currently fail. Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-10-28 15:54:53 +02:00
Eric Liang	f60d312259	Try fixing reference counting issue with manual _owner assignment (#19734 )	2021-10-28 02:26:35 -07:00
Guyang Song	119318932a	remove the env config 'RAY_DASHBOARD_MODULE_EVENT' (#19629 )	2021-10-28 16:51:59 +09:00
SangBin Cho	c414eb20d5	[Internal Observability] Improve the per task/actor resource usage visibility (#19782 ) * prototype done * done	2021-10-28 00:21:22 -07:00
Patrick Ames	8a9f664d75	[data] Add support for custom dataset block write path providers. (#19347 ) Co-authored-by: Eric Liang <ekhliang@gmail.com>	2021-10-28 00:12:02 -07:00
Chen Shen	224ed0fa5c	[Core][CoreWorker] graceful shutdown if GetCoreWorker is null (#19598 ) There are cases that the language frontend calls GetCoreWorker() after the worker has already been shutdown. Currently this results in a crash and causes confusions. pid=3714) [2021-10-21 10:50:23,596 C 3714 33544237] core_worker.cc:194: Check failed: core_worker_process The core worker process is not initialized yet or already shutdown. (pid=3714) * StackTrace Information * (pid=3714) ray::GetCallTrace() (pid=3714) ray::SpdLogMessage::Flush() (pid=3714) ray::SpdLogMessage::~SpdLogMessage() (pid=3714) ray::RayLog::~RayLog() (pid=3714) ray::core::CoreWorkerProcess::EnsureInitialized() (pid=3714) ray::core::CoreWorkerProcess::GetCoreWorker() (pid=3714) __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id() (pid=3714) _PyMethodDef_RawFastCallKeywords (pid=3714) _PyMethodDescr_FastCallKeywords (pid=3714) call_function (pid=3714) _PyEval_EvalFrameDefault (pid=3714) function_code_fastcall (pid=3714) property_descr_get (pid=3714) _PyObject_GenericGetAttrWithDict (pid=3714) _PyEval_EvalFrameDefault (pid=3714) _PyEval_EvalCodeWithName (pid=3714) _PyFunction_FastCallKeywords (pid=3714) call_function (pid=3714) _PyEval_EvalFrameDefault (pid=3714) function_code_fastcall (pid=3714) call_function (pid=3714) _PyEval_EvalFrameDefault (pid=3714) function_code_fastcall (pid=3714) method_call (pid=3714) PyObject_Call (pid=3714) _PyEval_EvalFrameDefault (pid=3714) function_code_fastcall (pid=3714) call_function (pid=3714) _PyEval_EvalFrameDefault (pid=3714) function_code_fastcall (pid=3714) call_function (pid=3714) _PyEval_EvalFrameDefault (pid=3714) function_code_fastcall (pid=3714) method_call (pid=3714) PyObject_Call (pid=3714) t_bootstrap (pid=3714) pythread_wrapper (pid=3714) _pthread_start (pid=3714) thread_start	2021-10-27 23:11:53 -07:00
Jiajun Yao	7fb65abae1	[data] Fix dataset doc (#19821 )	2021-10-27 22:41:09 -07:00
Alex Wu	46965e7672	[ARM] Use uint64_t instead of unsigned long (#13774 ) Co-authored-by: Alex Wu <alex@anyscale.com>	2021-10-27 21:08:25 -07:00
Jiajun Yao	11751a1d87	Arrow block dataset groupBy (#19673 )	2021-10-27 16:27:11 -07:00
Edward Oakes	b2e12dc43b	[runtime_env] Add basic support for python modules (#19651 )	2021-10-27 17:56:46 -05:00
gjoliver	39b0faa3ec	[RLlib]: bug fix, should be input_dict['is_training'] (#19805 )	2021-10-27 23:30:43 +02:00
Sven Mika	4a82d3ea6c	Revert "[RLlib; Docs overhaul] Docstring cleanup: Trainer, trainer_template, Callbacks. (#19758 )" (#19806 ) This reverts commit `80eeb13175`.	2021-10-27 23:30:07 +02:00
Simon Mo	3e038aebb2	[CI] Allow release tests infra to accept buildkite artifacts (#19803 )	2021-10-27 13:04:01 -07:00
Yi Cheng	98961d1ee2	[core] Fix the wrong error message in gcs for worker exits (#19774 )	2021-10-27 12:55:27 -07:00
matthewdeng	aa5499ef0f	[Train] implement CheckpointStrategy (#19111 ) * [SGD] implement CheckpointStrategy * address comments * update docs * Update doc/source/train/user_guide.rst Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> * best checkpoint Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2021-10-27 11:31:04 -07:00
Amog Kamsetty	5d54412f1c	[Docker] Alias `ray-ml:nightly` to `ray-ml:nightly-gpu` (#19726 ) * wip * wip * update * finish * deprecate * debug * fix and address comments * try catch * fix * split tests * force * merge * docs * wip * fix and check * update readme * fix * fix * fix sanity checking * format * alias * fix * comment	2021-10-27 11:30:49 -07:00
Edward Oakes	1f681981af	[serve] Bump controller max concurrency to 15k, make long poll timeout random (#19790 )	2021-10-27 13:28:16 -05:00
Yi Cheng	abec07700a	[nightly] Adding more tests related to grpc broadcasting to staging mode (#19779 ) ## Why are these changes needed? We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging. ## Related issue number #19438	2021-10-27 10:46:13 -07:00
Edward Oakes	acc5702535	[runtime_env] Fix hash length in URI (#19777 )	2021-10-27 12:22:20 -05:00
mwtian	b238297bfb	[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687 ) This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis. Most important logic added are GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc} GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc} Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups. This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added. All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior. The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in. Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.	2021-10-28 01:18:54 +08:00
Sven Mika	80eeb13175	[RLlib; Docs overhaul] Docstring cleanup: Trainer, trainer_template, Callbacks. (#19758 )	2021-10-27 19:15:35 +02:00
Sven Mika	f2cb2ed203	[RLlib; Docs overhaul] Docstring cleanup: Policies, policy_templates. (#19759 )	2021-10-27 19:14:39 +02:00
SangBin Cho	418b4a94e6	[Core] Remove legacy scheduler code (#19780 ) * Remove unused worker APIs * Remove unused scheduling resources. * lint	2021-10-27 06:57:08 -07:00
Simon Mo	40d52edabc	[CI] Upload wheels to artifact store in all jobs (#19778 )	2021-10-27 10:27:56 +01:00
Simon Mo	6afbd1f558	[Serve] /api/snapshot works with all Serve KVStores (#19772 )	2021-10-26 23:27:38 -07:00
Jiao	3f628d4f6b	increase long poll timeout and wrk trial cpu resource (#19768 )	2021-10-26 21:31:39 -07:00
SangBin Cho	bcd27b708f	[Test] Mark many ppo as unstable (#19769 )	2021-10-26 21:27:43 -07:00
Qing Wang	7647ea3512	[Java] Add helper method to build driver process. (#19740 ) We make the buildDriver() process as a helpful util to avoid duplicate code.	2021-10-27 10:17:37 +08:00
architkulkarni	6bd49a8cd5	[runtime env] Improve working dir messaging (#18893 )	2021-10-26 20:58:02 -05:00
Amog Kamsetty	db863aafc0	Revert "Revert "[Docker] Support multiple CUDA Versions (#19505 )" (#19756 )" (#19763 ) This reverts commit `e58fcca404`.	2021-10-26 17:32:56 -07:00
Jiajun Yao	47744d282c	[data] Fix arrow dataset sort on empty blocks (#19707 )	2021-10-26 15:30:23 -07:00
SangBin Cho	3e81506d90	[Threaded actor] Fix threaded actor race condition (#19751 )	2021-10-26 15:17:53 -07:00
Eric Liang	2652ae7905	[client] Put of a list should not return a list, this is a client bug (#19737 )	2021-10-26 13:51:37 -07:00
Amog Kamsetty	e58fcca404	Revert "[Docker] Support multiple CUDA Versions (#19505 )" (#19756 ) This reverts commit `f0053d405b`.	2021-10-26 12:55:20 -07:00
Yi Cheng	2ec9a70e24	[gcs] Fix the regression of enabling grpc based broadcasting in actor scheduling (#19664 ) ## Why are these changes needed? Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request. TODO: - Push the snapshot to raylet if the message is lost. - Handle message loss in raylet better. ## Related issue number #19438	2021-10-26 12:00:37 -07:00

1 2 3 4 5 ...

10144 commits