hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-18 17:16:39 -04:00

Author	SHA1	Message	Date
Sven Mika	e5ead6a4b0	[RLlib; Documentation] Minor fixes "rllib in 60s" and per-feature sigils. (#20248 )	2021-11-13 22:10:47 +01:00
mwtian	df8042c576	[Client] connect to localhost via 127.0.0.1 (#20274 ) Some Ray client users are likely seeing an issue similar to #7084. Inside a container, connecting to localhost: fails but connecting to 127.0.0.1: succeeds. Changing Ray client to use 127.0.0.1 for localhost connection / serving should fix the issue.	2021-11-13 11:12:55 -08:00
architkulkarni	96de740cd2	[runtime env] Enable multinode tests (#20264 )	2021-11-13 11:08:29 -08:00
Amog Kamsetty	65a17da2ec	[Train] Refactor Backends (#20312 ) * wip * finish * comment * fix * install horovod for docs * address comment * fix doc build failure	2021-11-13 11:05:53 -08:00
Amog Kamsetty	4396419a64	[Release] Fix tune_rllib connect test (#20321 ) * [Release] Fix tune_rllib connect test * use canonical app config	2021-11-13 10:11:20 -08:00
xwjiang2010	f13c2a5350	[Tune] Revert "remove pg caching" (#20308 ) This reverts commit `5f14eb3ee4`.	2021-11-13 16:25:22 +00:00
Antoni Baum	1b867520e6	[docs]Add pyarrow as a dependency (#20320 )	2021-11-13 16:00:58 +00:00
mwtian	875b0aea0a	fallback to grpc.experimental.aio when importing grpc.aio (#20287 )	2021-11-13 15:59:57 +09:00
mwtian	cdadc2b7d2	Change owner (#20313 )	2021-11-12 21:23:36 -08:00
Eric Liang	567e955810	Revert "[job submission] Use ray.init format addresses for JobSubmissionClient (#20245 )" (#20314 ) This reverts commit `adc15a0fb0`.	2021-11-12 21:11:24 -08:00
gjoliver	7fe42341ed	[release] Switch many_ppo test to use the canonical rllib app cfg as well. (#20310 )	2021-11-12 20:51:28 -08:00
Jiajun Yao	f8b738d029	[scheduler] Add scheduling debug log (#20302 )	2021-11-12 18:48:05 -08:00
matthewdeng	e77cc926be	[train] minor doc updates (#20271 )	2021-11-12 17:20:23 -08:00
Clark Zinzow	918a215442	[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation (#20074 )	2021-11-12 15:53:58 -08:00
mwtian	a39fd74674	disable //python/ray/tests:test_autoscaler_drain_node_api in HA GCS build (#20296 )	2021-11-12 15:47:42 -08:00
Tricia Fu	e59c14117f	[Doc] [Serve] Add summary sub header to each page (#20231 )	2021-11-12 14:18:42 -08:00
Nikita Vemuri	adc15a0fb0	[job submission] Use ray.init format addresses for JobSubmissionClient (#20245 )	2021-11-12 13:52:43 -08:00
xwjiang2010	cdf70c2900	[Tune] Remove legacy resources implementations in Runner and Executor. (#19773 )	2021-11-12 12:33:39 -08:00
Edward Oakes	73e570c426	Fix windows build (don't skip test_job_manager.py) (#20294 )	2021-11-12 11:13:15 -08:00
architkulkarni	138ec75246	[runtime env] Revert reference counting for per-actor URIs (#20281 )	2021-11-12 11:09:38 -08:00
Matti Picus	1e80a2a83a	[WINDOWS] unskip tests (#20212 )	2021-11-12 10:11:11 -08:00
Chen Shen	a617cb8813	[Core][actor out-of-order execution 1/n] Move CoreWorkerDirectActorTaskSubmitter into a separate file This is the stack of PRs that supports out or order execution for threaded/async actors. Next PR #20149 At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion. The major changes of this stack is introduce OutOfOrderActorSubmitQueue ([Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue #20150) Specifically, we have a per-client task submit queue, which guarantees the sequential order of task submission. In [Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue #20150 we introduce OutOfOrderActorSubmitQueue, which relaxes the guarantee; it send the task over the network as soon as its dependency is resolved. -- there are 2 PRs ([Core][actor out-of-order execution 1/n] Move CoreWorkerDirectActorTaskSubmitter into a separate file #20148 and [Core][actor out-of-order execution 2/n] create abstraction for the queuing logic on the client/actor submission. #20149) precedes this PR, which refactor the actor submission logic to make the abstraction possible. OutOfOrderActorSchedullingQueue ([Core][actor out-of-order execution 5/n] implement out-of-order scheduling queue #20176) Similarly, we also have a per-client task scheduling queue on the actor to ensure tasks are executed according to the submission order (sequence_no). OutOfOrderActorSchedullingQueue relaxes the guarantee by enqueuing the task as soon as all their dependencies are resolved. -- there is one PR ([Core][actor out-of-order execution 4/n] refactor the actor receiver code #20160) precedes this PR, which refactor the actor scheduling logic to make it the code easier to read; however this one is optional. plumbing PR ([Core][actor out-of-order execution 6/n] plumbing work to make it work e2e #20177) This PR enables the out of order execution by introducing options(execute_out_of_order=True); and create actor client/server components according to the configuration. There are something the PR hasn't touched upon is the restart/retry guarantees, which might need some discussion. This PR we separate client from server code for Actor task submission. This makes the follow up change easier.	2021-11-12 09:28:58 -08:00
Siyuan (Ryans) Zhuang	3b62388a9a	[Workflow] Workflow tail recursion optimization (#19928 ) * tail recursion optimization	2021-11-12 09:13:40 -08:00
Simon Mo	b6bd4fd5f3	[Serve] Don't recover from current state checkpoint (#19998 )	2021-11-12 09:02:27 -08:00
xwjiang2010	ce8504b0b2	[CI] Rebalance Tune tests a bit. (#20263 )	2021-11-12 15:30:18 +00:00
xwjiang2010	5f14eb3ee4	[Tune] Remove PG caching. (#19515 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2021-11-12 14:36:04 +00:00
Sven Mika	38c456b6f4	[RLlib; Tune] Fix rllib/train.py script after tune.Experiment c'tor change. (#20283 )	2021-11-12 15:25:50 +01:00
Kai Fricke	246787cdd9	Revert "[RLlib] POC: `PGTrainer` class that works by sub-classing, not `trainer_template.py`. (#20055 )" (#20284 ) This reverts commit `6f85af435f`.	2021-11-12 13:09:43 +00:00
Kai Fricke	d88fdd6e38	[tune] refactor SyncConfig (#20155 )	2021-11-12 09:36:15 +00:00
SangBin Cho	7132f91789	[Core] Reduce the frequency of retry messages (#20175 ) * Reduce the frequency of retry messages * done	2021-11-11 23:52:37 -08:00
Sven Mika	70fe25055a	[RLlib] Issue: Get single step input dict incorrect. (#20217 )	2021-11-12 08:38:51 +01:00
Edward Oakes	ee4e4f4036	[runtime_env] Support specifying the runtime_resources directory for testing (#20257 )	2021-11-11 21:50:42 -08:00
architkulkarni	33f680095d	[Test] [runtime env] Retry wheel urls for up to 2h to give time for Mac wheels to build (#19337 )	2021-11-11 21:48:35 -08:00
Edward Oakes	7c9881b73d	[serve] Fix serve_failure test (#20268 )	2021-11-11 19:19:34 -08:00
Edward Oakes	eb6449b21b	[serve] Remove 5s halt from controller startup (#20262 )	2021-11-11 19:18:43 -08:00
SangBin Cho	e901180a55	Do not import pytest in test util (#20252 )	2021-11-12 12:09:28 +09:00
Qing Wang	7500f7d88a	Remove deprecated Java PG APIs. (#20219 ) These APIs were deprecated at least 7+ months and 4+ versions, it's the time and very necessary to remove them.	2021-11-12 09:29:48 +08:00
Qing Wang	5d773e75e6	Fix idle worker leak issue if it received a SIGTERM when DrainAndShutdown. (#19877 ) This PR fixes the issue that worker might be leaked if task finished with some errors. See #19639 for more details.	2021-11-12 09:26:46 +08:00
mwtian	be29fa0302	[CI] make using gcc 9 explicit (#20147 )	2021-11-11 16:12:40 -08:00
chenk008	74fa267c72	Enable worker in container CI test (#20174 )	2021-11-11 16:11:06 -08:00
Edward Oakes	5ae5c1ba28	[job submission] Basic CLI prototype (#20204 )	2021-11-11 15:59:13 -08:00
Teofilo Zosa	abf0eb53cc	Fix aiohttp 3.8.0 breaking changes (and unpin from 3.7) (#20261 )	2021-11-11 15:35:20 -08:00
Michael Galarnyk	dbeb2e2f73	Add Ray Serve Blogs to Doc(#19846 ) The Serving ML Models in Production blog links is inline with the latest Ray Summit talk on Ray Serve.	2021-11-11 15:10:36 -08:00
Edward Oakes	59698aa89c	[Serve] add survey link (#20230 )	2021-11-11 15:10:10 -08:00
mwtian	0330852baf	[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111 ) ## Why are these changes needed? This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`. Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work. Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published. GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change. ## Related issue number	2021-11-11 14:59:57 -08:00
Simon Mo	fca851eef5	[Serve] Change ReplicaName to use internal prefix (#20067 )	2021-11-11 14:21:34 -08:00
Jiajun Yao	992ab3e098	[Release] Commit sanity check when a url is provided (#20255 )	2021-11-11 13:33:58 -08:00
Jules S. Damji	71a162d8ab	Fixed code snippet to include config parameter and a minor typo (#20193 ) Signed-off-by: Jules S.Damji <jules@anyscale.com> Co-authored-by: Jules S.Damji <jules@anyscale.com>	2021-11-11 18:37:03 +00:00
Dmitri Gekhtman	8971422d8f	[autoscaler] Use drain node api in autoscaler before terminating nodes (#20013 ) * wip * Draft * Use bytest for node id * remove stray helm change * fix autoscaler init arg * don't forget to instantiate new load metrics dict * remove extraneous diff * Timeout, comments, function signature. * typo * another comment * tweak * docstring * shorter timeout * Use a better error code * missing self * Dedent example * Add drain node prometheus metric. * comment * Update tests part 1: test_autoscaler.py * Update tests part 2: test_resource_demand_scheduler * lint * Update tests part 3: test_autoscaling_policy * Unit tests for new Prometheus metric and DrainNode error handling. * comment * removed unused function * Try adding ability to mock out process termination to fake node provider * Add integration test. * fix * fix * lint * Improve log message * fix * Simplify test * Fix doc example * remove unused dict * Mock out process termination in a subclass * Add add doc string and comment explaining prune active ips. * Comment: wtf is use_node_id_as_ip * one more comment * more explanation * period * tweak	2021-11-11 08:31:40 -08:00
SangBin Cho	9fd8c6648c	[Test] Fix newly added nightly tests, threaded actor + chaos testing (#20220 ) * Fix nightly tests * done * done	2021-11-11 05:01:19 -08:00

... 67 68 69 70 71 ...

13756 commits