hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Clark Zinzow	0704b825ff	[Datasets] Add spread resource prefix for manual round-robin resource-based task load balancing. (#18776 )	2021-09-20 22:41:11 -07:00
Eric Liang	361a13602c	Actor repr for log prefix should be computed after init, not before (#18749 )	2021-09-20 21:34:53 -07:00
DK.Pino	d329101469	Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." (#18735 ) * fix conflict * cxx lint	2021-09-20 20:18:12 -07:00
Yi Cheng	07babd807c	Revert "Revert "[core] Async submitting actor registerring (#18009 )" (#18719 )" (#18722 )	2021-09-20 19:17:00 -07:00
Sasha Sobol	65c1c8bb9e	Add an integration test for scheduler_avoid_gpu_nodes (#18763 )	2021-09-20 17:20:42 -07:00
Jiao	9bb4a87031	[runtime_env] Add experimental job yaml (#18768 )	2021-09-20 18:00:25 -05:00
Stephanie Wang	eafe6d5c79	Fix ref counting assertion check (#18752 ) * Fix assertion crash * test, lint * todo * x	2021-09-20 15:16:19 -07:00
Kai Fricke	cee18152f1	[tune] Remove deprecated features, promote warnings to errors (#18595 )	2021-09-20 22:54:28 +01:00
Kai Fricke	2e99fb215f	[tune] Cache unstaged placement groups for potential re-use (#18706 )	2021-09-20 20:23:35 +01:00
Ian Rodney	8d6ddcee53	[GCP] Add `conda` to the path when possible. (#18653 )	2021-09-19 23:06:48 -07:00
Eric Liang	2fa9648ef0	Revert "add integration test for gpu scheduling/avoidance (#18729 )" (#18754 ) This reverts commit `57edc0c607`.	2021-09-19 17:05:05 -07:00
Dmitri Gekhtman	ffe533b297	[autoscaler] Log ips and ids when terminating nodes, code structure (#18180 ) * recovery failure uses same termination function * More cleanup * More cleanup * ips * wip * wip * wip * Fix tests * tweak	2021-09-19 18:44:38 -04:00
xwjiang2010	5551cdac19	[Tune] Break from loop after warning msg is logged. (#18720 )	2021-09-18 16:33:44 -07:00
mwtian	32f71765e9	[Client] Allow Client{Object,Actor}Ref to accept a future. (#18677 ) * Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously. * rename callback, fix	2021-09-18 16:32:02 -07:00
Sasha Sobol	57edc0c607	add integration test for gpu scheduling/avoidance (#18729 )	2021-09-18 01:32:18 -07:00
Chen Shen	eab1d28fd3	fix test (#18737 )	2021-09-18 00:57:34 -07:00
Jiao	948508efb8	[Serve] Add checkpoint options and custom storage option (#18657 )	2021-09-18 00:04:29 -07:00
DK.Pino	4ef8fd6942	remove the legacy retry mechanism (#18589 )	2021-09-18 11:11:19 +08:00
Amog Kamsetty	0211101e6f	[SGD] Redo Class API (#18728 ) * wip * wip * add horovod example * add example * lint * fix * address comments * updates * lint * update example * address comment * address comment * update * fix * Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * address comments * add back name mangling * fix tests * Update python/ray/util/sgd/v2/trainer.py * fix * lint * fix * fix docstring * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * update * fix failing test Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2021-09-17 18:17:51 -07:00
Clark Zinzow	aaa097c293	[Datasets] Shuffled data loading support (#18678 )	2021-09-17 16:08:53 -07:00
Simon Mo	f2ea6c4e68	[Serve] Call Callable.__del__ explicit during graceful shutdown (#18446 )	2021-09-17 15:19:57 -07:00
Chris K. W	8858489e2f	[client] let ray client reconnect on grpc failures (#18329 ) * wip * client tests working again * extra prints * start reconnect logic for proxier * local proxy more wip * delay cleanup logic working on proxy * Fix up dataservicer logic * lint + fix proxy data servicer exit logic * hmmm * delay cleanup always in dataservicer * fix last_seen check * cancel channel on error * explicitly request cleanup * cleanup request fixes * fix dataclient proxy * start idempotence logic * change default channel state * add backoff logic * move connection logic back into worker.__init__ * add logic for replay cache case where request was received but response hasn't been fully resolved * new proto entries for data stream caching * start replay_cache logic, increase cleanup delay * hardcode retries * Let data channel attempt reconnects * manually reset queue, remove replay_cache logic * reduce cleanup delay to 5 minutes * fix local tests * Remove async cache logic * retry async requests * simplify backoff logic * Fix ray client proto * Configurable reconnect grace period * Basic logsclient fix? * Configure grace through environment variable * Use stopped event to force faster datapath cleanup * Better connect+reconnect logic * fix reconnect_grace_period default * init fixes for reconnect_grace_period * cleanup * fix _get_client_id_from_context call * add logic for pathological cache cases * less intrusive data channel error message * fix tests * Make stuff less painful to read * add ordered replay cache for dataservicer, replay cache tests * fix ordering import, start_reconnect test * add middleman testing logic * enforce ordering of dataclient requests * retry wheels * grace period through env only, restore test_dataclient_disconnect * minor fixes * force rerun * less intrusive error msgs * address review * replay->response cache * remove unneeded sleep * typing * extra response cache test * fix error msg * remove TODO * add _reconnect_channel * add grace period test * store thread_id and req_id in metadata * Revert "store thread_id and req_id in metadata" This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181. * Revert "Revert "store thread_id and req_id in metadata"" This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19. * fix metadata check * remove comment * removed unused cv * cast back to int * refactor Datapath for readability * Revert refactor This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13. * fix comment * merge fixes * refactor _shutdown * address reviews * log errors in both cases * add comments * address reviews * move reconnect test to medium * Always propogate error to callbacks * readability * formatting * Faster cleanup on uncaught dataservicer errors * delete tmp file * offset commit * rrefactor * propagate data servicer error message * Stricter handling/propagation of errors * remove tmp file * better docs * forward reconnecting metadata * add annotation * fix invalidate + add test * fix docstrings and types * disable retries and caching if reconnect grace period is set to 0 * update comments * address review, increase ack batch size and skip ack's if reconnect isn't enabled * Don't terminate data stream on missing reconnecting metadata	2021-09-18 01:11:00 +03:00
Yi Cheng	cf64ab5b90	Revert "[core] Async submitting actor registerring (#18009 )" (#18719 ) This reverts commit `8ce01ea2cc`.	2021-09-17 13:34:12 -07:00
xwjiang2010	9c8c6c09cb	Revert "[SGD] v2 Class API (#18571 )" (#18715 ) This reverts commit `de050e8187`.	2021-09-17 10:34:36 -07:00
Yi Cheng	8ce01ea2cc	[core] Async submitting actor registerring (#18009 )	2021-09-17 10:03:35 -07:00
Clark Zinzow	1da83c828c	[Datasets] Properly support fs inference on path with space. (#18644 )	2021-09-17 10:02:43 -07:00
architkulkarni	a9cce8a34b	[serve] Add basic calculate_desired_num_replicas function for autoscaling (#18658 )	2021-09-17 00:18:51 -07:00
Simon Mo	3029812b8b	[Serve] Autoscaling metric store take 2 (#18683 )	2021-09-16 22:28:13 -07:00
Eric Liang	c9ca980c83	Check dataset pipeline is not read multiple times by accident (#18682 )	2021-09-16 20:33:24 -07:00
Amog Kamsetty	84e958f330	[ML] Consolidate and upgrade Deep Learning Dependencies (#18574 ) * wip ' * upgrade requirements * add file * fix * fixes * Apply suggestions from code review Try mlagents==0.21.0 for now (works with torch 1.9). * Apply suggestions from code review * wip * wip * fix * fix * upgrade lightning bolts * address comment Co-authored-by: Sven Mika <sven@anyscale.io>	2021-09-16 20:16:40 -07:00
Amog Kamsetty	de050e8187	[SGD] v2 Class API (#18571 ) * wip * wip * add horovod example * add example * lint * fix * address comments * updates * lint * update example * address comment * address comment * update * fix * Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * address comments * add back name mangling * fix tests * Update python/ray/util/sgd/v2/trainer.py * fix * lint * fix * fix docstring * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * update Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2021-09-16 12:33:38 -07:00
Simon Mo	eeaae5aa08	Revert "[Serve] Add InMemoryMetricsStore for Autoscaling (#18458 )" (#18675 ) This reverts commit `a024effac7`.	2021-09-16 11:37:31 -07:00
Simon Mo	a024effac7	[Serve] Add InMemoryMetricsStore for Autoscaling (#18458 )	2021-09-16 11:08:42 -07:00
Simon Mo	317a34c523	[Serve] Use BackendConfig Protobuf (#17835 )	2021-09-16 11:08:23 -07:00
Edward Oakes	e7ea1f9a82	[runtime_env] Remove global logger from working_dir code (#18605 )	2021-09-16 10:37:45 -05:00
Jernej Makovsek	b5c5247ad4	Update example yaml file for running local clusters (#18530 )	2021-09-16 02:24:45 -07:00
xwjiang2010	ea48b1227f	[Tune] Do not crash when resources are insufficient. (#18611 )	2021-09-15 23:00:53 -07:00
Stephanie Wang	be7cb70c30	[core] Fix ref counting during actor construction (#18646 ) * test * fix * cpp * skip windows Co-authored-by: Eric Liang <ekhliang@gmail.com>	2021-09-15 22:16:53 -07:00
Chris K. W	7df3441ae9	[client] Fix credential generation when secure=True but no credentials provided (#18636 ) * set self._credentials if not provided * fix credential generation	2021-09-16 00:37:33 +03:00
Antoni Baum	7e95f330d5	[ci] Fix xgboost_ray install from git (#18640 )	2021-09-15 18:07:15 +01:00
Antoni Baum	d50ff16ccf	[ci] Fix HEBO breaking Tune tests (#18629 )	2021-09-15 10:01:29 -07:00
Kai Fricke	0223ae9605	[xgboost] Bump xgboost_ray requirements_upstream.txt version to 0.1.3 (#18632 )	2021-09-15 18:01:15 +01:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Edward Oakes	7d0a2b39e3	[runtime_env] Remove dynamically imported setup_hook (#18601 )	2021-09-15 10:19:55 -05:00
Antoni Baum	eeb67a42cc	pip install xgboost_ray -> xgboost_ray[default] (#18607 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-09-15 14:45:56 +01:00
Sven Mika	8a00154038	[RLlib] Bump tf version in ML docker to tf==2.5.0; add tfp to ML-docker. (#18544 )	2021-09-15 08:46:37 +02:00
SangBin Cho	0684531e22	[Test] Break down placement group tests (#18612 )	2021-09-14 21:55:18 -07:00
Chris K. W	cc1d7b8174	[client] Refactors for Reconnect PR (#18484 ) * add refactors * add worker annotation * Regenerate credentials by default * use self._secure * infer secure if credentials provided * separate _shutdown	2021-09-14 16:13:35 -07:00
Eric Liang	15512c27c2	Revert "Revert "Route core worker ERROR/FATAL logs to driver logs (#1… (#18604 )	2021-09-14 13:32:07 -07:00
SangBin Cho	31e1638fb3	[CLI] Improve ray status for placement groups (#18289 )	2021-09-14 11:29:13 -07:00

1 2 3 4 5 ...

5107 commits