hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Amog Kamsetty	db863aafc0	Revert "Revert "[Docker] Support multiple CUDA Versions (#19505 )" (#19756 )" (#19763 ) This reverts commit `e58fcca404`.	2021-10-26 17:32:56 -07:00
Jiajun Yao	47744d282c	[data] Fix arrow dataset sort on empty blocks (#19707 )	2021-10-26 15:30:23 -07:00
SangBin Cho	3e81506d90	[Threaded actor] Fix threaded actor race condition (#19751 )	2021-10-26 15:17:53 -07:00
Eric Liang	2652ae7905	[client] Put of a list should not return a list, this is a client bug (#19737 )	2021-10-26 13:51:37 -07:00
Amog Kamsetty	e58fcca404	Revert "[Docker] Support multiple CUDA Versions (#19505 )" (#19756 ) This reverts commit `f0053d405b`.	2021-10-26 12:55:20 -07:00
Yi Cheng	2ec9a70e24	[gcs] Fix the regression of enabling grpc based broadcasting in actor scheduling (#19664 ) ## Why are these changes needed? Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request. TODO: - Push the snapshot to raylet if the message is lost. - Handle message loss in raylet better. ## Related issue number #19438	2021-10-26 12:00:37 -07:00
gjoliver	99a0088233	[RLlib] Unify the way we create local replay buffer for all agents (#19627 ) * [RLlib] Unify the way we create and use LocalReplayBuffer for all the agents. This change 1. Get rid of the try...except clause when we call execution_plan(), and get rid of the Deprecation warning as a result. 2. Fix the execution_plan() call in Trainer._try_recover() too. 3. Most importantly, makes it much easier to create and use different types of local replay buffers for all our agents. E.g., allow us to easily create a reservoir sampling replay buffer for APPO agent for Riot in the near future. * Introduce explicit configuration for replay buffer types. * Fix is_training key error. * actually deprecate buffer_size field.	2021-10-26 20:56:02 +02:00
xwjiang2010	ab15dfd478	[Tune release test] Set 500G disk space for rllib_tests. (#19730 )	2021-10-26 10:12:03 -07:00
Avnish Narayan	ad87ddf93e	[rllib] Add deterministic test to gpu (#19306 ) Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-10-26 10:11:39 -07:00
iasoon	b5158ca0ab	[serve] Correctly set num_replicas when deploying autoscaling deployment (#19520 )	2021-10-26 12:10:59 -05:00
Lixin Wei	c937950910	Add 'local' Tag to `@com_github_antirez_redis//:bin` (#19685 ) * Build redis locally * fix	2021-10-26 09:17:52 -07:00
SangBin Cho	00ea716ada	Revert "Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452 )" (#19724 )" (#19736 ) This reverts commit `d453afbab8`.	2021-10-26 08:25:09 -07:00
Jiao	aaef82920d	[serve] Add periodic timeouts to long poll client to avoid accumulating concurrent tasks in the controller (#19728 )	2021-10-26 09:44:00 -05:00
SangBin Cho	e914ea930d	[Core] Stop reporting tasks spec to GCS that are unnecessary #19699 (#19699 ) This RPC is from legacy code and not needed anymore (the task spec is already in the actor table), but it adds quite amount of keys to Redis. The below is the sum of bytes size(? I am not sure if it is bytes size, but I grabbed the length of the value when I queried Redis) of each prefix when running many_ppo. As you can see Task& and Task takes a lot of part although they are not really used. �[0m ��[12A�[9C�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[0mb�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[10D�[0m�[J�[0;38;5;28mIn [�[0;92;1m82�[0;38;5;28m]: �[0mb�[10D�[0m �[J�[?7h�[0m�[?12l�[?25h�[?2004l�[0m�[?7h�[0;38;5;88mOut[�[0;91;1m82�[0;38;5;88m]: �[0m�[0m defaultdict(int, {b'WORKE': 1080864, b'ACTOR': 1470931, b'TASK&': 1020646, b'TASK:': 870551, b'PROFI': 360000, b'PLACE': 10107, b'JOB:\x01': 8, b'JOB:\x04': 8, b'NODE:': 99, b'NODE_': 126, b'INTER': 44, b'JOB:\x03': 8, b'redis': 16, b'JOB:\x02': 8, b'JOB:\x05': 8})	2021-10-26 04:17:58 -07:00
Kai Fricke	98244ad130	[ci/release] Report error to database on alert (#19743 )	2021-10-26 10:48:02 +01:00
Kai Fricke	96ddf5b9ac	[ci/release] Choose cloud by name or ID (#19742 )	2021-10-26 10:21:54 +01:00
Kai Fricke	3081488a99	[tune] Fix local checkpoint deletion for remote trials (#19632 )	2021-10-26 09:18:07 +01:00
Amog Kamsetty	6e61ca623d	[CI] Infra for "user" tests (#19662 )	2021-10-26 08:47:22 +01:00
SangBin Cho	ba61c436ea	Revert "Try enabling event stats by default (#19650 )" (#19735 ) This reverts commit `6081cf870e`.	2021-10-26 14:33:40 +09:00
Eric Liang	81b0eb297c	Un-revert size estimator and fix Train test (#19719 )	2021-10-25 22:09:24 -07:00
Eric Liang	10e27892c2	Suppress tsan false positive in gcs-pub-sub-test (#19727 )	2021-10-25 19:52:53 -07:00
Amog Kamsetty	f0053d405b	[Docker] Support multiple CUDA Versions (#19505 ) * wip * wip * update * finish * deprecate * debug * fix and address comments * try catch * fix * split tests * force * merge * docs * wip * fix and check * update readme * fix * fix * fix sanity checking * format	2021-10-25 18:57:05 -07:00
SangBin Cho	d453afbab8	Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452 )" (#19724 ) This reverts commit `e3ced0e59e`.	2021-10-26 09:14:25 +09:00
Simon Mo	5330aab27a	[CI] Deflake test metrics (#19711 )	2021-10-25 16:34:20 -07:00
Alex Wu	045d72cdc0	[docs] Fix typo in installation instructions (#19721 )	2021-10-25 15:30:34 -07:00
Eric Liang	66818d11b8	Revert "[data] Add serialized size estimator to block builder (#19681 )" (#19717 ) This reverts commit `8c37311c41`.	2021-10-25 15:06:58 -07:00
Eric Liang	8c37311c41	[data] Add serialized size estimator to block builder (#19681 )	2021-10-25 14:58:49 -07:00
SangBin Cho	ecd5a622ef	[Tests] Add a memory usage on dask on ray tests (#19674 )	2021-10-25 14:58:26 -07:00
SangBin Cho	544f774245	[Autoscaler/Core] Drain node API (#19350 ) * Initial version done. Graceful shutdown is possible with direct raylet RPCs * . * . * ip * Done. * done tests might fail * fix lint + cpp tests * fix 2 * Fix issues. * Addressed code review. * Fix another cpp test failure * completed * Skip windows tests * Update the comment * complete * addressed code review.	2021-10-25 14:57:50 -07:00
Linsong Chu	13d4894789	[workflow] Add get_metadata() for workflow (#19372 ) ## Why are these changes needed? Add the functionality to retrieve metadata for a workflow or workflow step. Design: - Similar to `get_output`, this will either return the metadata for workflow (`workflow.get_metadata(workflow_id)`) or the metadata for a specific step (`workflow.get_metadata(workflow_id, step_id)`) - Exceptions will only be raised if workflow id or step id not exist. Canceled job, running job, etc. will return proper metadata by retrieving information from checkpoint. See [here](`8c8ca609d7/python/ray/workflow/tests/test_metadata_get.py (L67)`) for more details. - Returned metadata is an aggregated result from multiple checkpoint files based on previous [discussion](https://github.com/ray-project/ray/issues/17090#issuecomment-920481789). The aggregation logic is [here for step metadata](`8c8ca609d7/python/ray/workflow/workflow_storage.py (L451)`) and [here for workflow metadata](`8c8ca609d7/python/ray/workflow/workflow_storage.py (L484)`) which can be tuned with further discussion. Example: ```python >>> user_step_metadata = {"k1": "v1"} >>> user_run_metadata = {"k2": "v2"} >>> step_name = "simple_step" >>> workflow_id = "simple" >>> @workflow.step >>> def simple(): >>> return 0 >>> simple.options(name=step_name, metadata=user_step_metadata).step().run(workflow_id, metadata=user_run_metadata) # get workflow-level metadata >>> workflow.get_metadata("simple") {'status': 'SUCCESSFUL', 'user_metadata': {'k2': 'v2'}, 'stats': {'start_time': 1634173413.116535, 'end_time': 1634173413.149051}} # get step-level metadata >>> workflow.get_metadata("simple", "simple_step") {'name': '__main__.simple', 'step_type': 'FUNCTION', 'workflows': [], 'max_retries': 3, 'workflow_refs': [], 'catch_exceptions': False, 'ray_options': {}, 'user_metadata': {'k1': 'v1'}, 'stats': {'start_time': 1634173413.131262, 'end_time': 1634173413.1347651}} ``` ## Related issue number https://github.com/ray-project/ray/issues/17090	2021-10-25 14:52:51 -07:00
Alex Wu	58b28f04cd	[docs/usability] Apple Silicon support (#19705 ) This PR puts the final touches on apple silicon support. There are 3 main caveats to supporting M1 macs right now (described in the docs): Requires using forge. Requires special installation instructions to get grpc working (this is an underlying grpc issue, so ideally it will be fixed upstream). We're only publishing release wheels, not nightlies right now. This also includes a grpc import check to ensure that we provide an actionable error message if the user tries the regular pip install ray process to properly install grpcio.	2021-10-25 14:49:28 -07:00
DK.Pino	e3ced0e59e	[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452 ) * fixed * lint * add cxx ut * fix comment * Revert "fix comment" This reverts commit 32ea2558166a7674d7efe2e0c0a66ea7409c7d99. * fix comment	2021-10-25 14:15:36 -07:00
architkulkarni	2c64b2b0e8	[Doc] Move all contribution info to getting-involved.html and link to it from CONTRIBUTING.rst (#19571 )	2021-10-25 14:23:23 -05:00
Eric Liang	6081cf870e	Try enabling event stats by default (#19650 )	2021-10-25 12:19:34 -07:00
Eric Liang	27a5b546ad	Make ArrowRow less scary (#19686 )	2021-10-25 12:18:42 -07:00
Jiajun Yao	e4542be0d1	[Java] Run java on mac with public ip (#19701 )	2021-10-25 11:38:33 -07:00
Tao Wang	ff7d35d246	[Core]Add test case for cached named actor (#19510 ) ## Why are these changes needed? Recently we found a bug about named actor cache, only in internal codebase but not community, and the case is not covered by test case so we didn't know before user telling us. This add an extra test to cover it. Bug Detail: we didn't publish actor's name when the actor is dead so the cache keep the name to the old actor handle. The owner of this actor cannot sense this bug because the cache didn't apply to the owner currently.	2021-10-25 11:37:41 -07:00
xwjiang2010	46266b15f0	[tune] Avoid looping through _live_trials twice in _get_next_trial. (#19596 )	2021-10-25 19:26:55 +01:00
chenk008	b65aca9002	flush stdout/stderr to avoid empty log in docker start block (#19546 )	2021-10-25 10:58:48 -07:00
architkulkarni	414910b7fc	[test] [runtime env] Add release test with Ray Client and local pip files (#19026 )	2021-10-25 11:49:27 -05:00
architkulkarni	f101f7cc02	[runtime_env] Allow specifying runtime env in @ray.remote decorator with Ray Client (#19626 )	2021-10-25 10:32:31 -05:00
Sven Mika	b213565783	[RLlib] Fix failing test cases: Soft-deprecate ModelV2.from_batch (in favor of ModelV2.__call__). (#19693 )	2021-10-25 15:00:00 +02:00
Kai Fricke	6e455e59d8	[tune] Verbosely/gracefully handle empty experiment checkpoints (#19641 )	2021-10-25 13:41:18 +01:00
Kai Fricke	0cfa267fde	[tune] Fix shim error message for scheduler (#19642 )	2021-10-25 11:16:16 +01:00
gjoliver	89fbfc00f8	[RLlib] Some minor cleanups (buffer buffer_size -> capacity and others). (#19623 )	2021-10-25 09:42:39 +02:00
roireshef	9b0352f363	[RLlib] Added LearningRateSchedule and EntropyCoeffSchedule to TF and Torch versions of A3C and PPO (#19276 )	2021-10-25 09:39:35 +02:00
gjoliver	c3c42278e4	[RLlib] clean up all the SampleBatch['is_training'] deprecation warnings (#19652 ) * [RLlib] clean up all the SampleBatch['is_training'] deprecation warnings. * wip	2021-10-25 09:38:56 +02:00
Renos Zabounidis	41dd037ae9	[RLlib; Docs] Correcting documentation with respect to postprocess_trajectory (#19672 ) postprocess_trajectory is referred to incorrectly in the rllib-environments documentation. When defining a custom policy, a user never directly modifies Policy.postprocess_trajectory, they define postprocess_fn, which is in turn called by postprocess_trajectory.	2021-10-25 09:37:58 +02:00
Jiajun Yao	f6a0165286	Add dependabot for data processing (#19682 )	2021-10-24 20:49:43 -07:00
SangBin Cho	aa9eb6499c	[Test] skip pg restart test (#19670 )	2021-10-24 16:53:29 -07:00

1 2 3 4 5 ...

10050 commits