hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 21:06:39 -04:00

Author	SHA1	Message	Date
Max Pumperla	f9b71a8bf6	[docs] new structure (#21776 ) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).	2022-01-21 15:42:05 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
xwjiang2010	368da1742b	[tune] Enforce one future at a time for any given trial at any given time. (#20783 ) Also enforce disabling (instead of allowing user to override this) buffer training when checkpoint_at_end is used.	2021-12-03 08:14:12 -08:00
Will Drevo	fa878e2d4d	Added example to user guide for cloud checkpointing (#20045 ) Co-authored-by: will <will@anyscale.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-11-15 15:43:06 +00:00
xwjiang2010	cdf70c2900	[Tune] Remove legacy resources implementations in Runner and Executor. (#19773 )	2021-11-12 12:33:39 -08:00
Kai Fricke	d88fdd6e38	[tune] refactor SyncConfig (#20155 )	2021-11-12 09:36:15 +00:00
Kai Fricke	9c2b8c8501	[tune] Deprecate DurableTrainable (#19880 )	2021-11-08 20:56:07 +00:00
Antoni Baum	f2773267c7	[docs] Tune doc fixes (#19791 )	2021-10-29 11:45:29 +02:00
Antoni Baum	c7d6f838f6	[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once (#19144 )	2021-10-08 18:16:26 +01:00
Antoni Baum	cc3199b814	[docs] Provide information about resource deadlocks, early stopping in Tune docs (#18947 ) Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2021-10-01 13:52:47 +01:00
Richard Liaw	227aa9e89b	[tune] change delimiter for results (#16573 ) Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-09-28 10:03:00 +01:00
Kai Fricke	9b0d804eed	[tune] Add documentation for reproducible runs (setting seeds) (#18849 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2021-09-24 10:57:31 +01:00
xwjiang2010	5551cdac19	[Tune] Break from loop after warning msg is logged. (#18720 )	2021-09-18 16:33:44 -07:00
Kai Fricke	395976c8a1	[tune] Never block for results (#18391 ) * [tune] Never block for results * Fix tests * Block in tests * Add comment to test	2021-09-09 12:08:00 -07:00
Richard Liaw	0594deafdf	[tune] allow users to configure bootstrap for docker syncer (#17786 )	2021-09-05 22:04:31 -07:00
xwjiang2010	01adf030ec	[Tune] Raise Error when there are insufficient resources. (#17957 )	2021-09-03 10:49:54 -07:00
xwjiang2010	63f00843f3	[Tune] Inform users of the setup needed for uploading results to cloud. (#18220 )	2021-08-31 10:27:50 -07:00
xwjiang2010	0be9f06ab6	[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. (#17533 )	2021-08-13 01:37:56 -07:00
Amog Kamsetty	be238e159d	[Tune] Update docs for `with_parameters` (#17441 ) * with_parameters_doc * update docstring * address comments	2021-08-05 08:48:34 -07:00
Antoni Baum	0935ec30d0	[tune] Add information about environment variables to `tune.run` docstring (#16980 )	2021-07-11 17:20:17 -07:00
Kai Fricke	172d33be02	[tune] Use unbuffered training when checkpoint_at_end is used. (#16504 )	2021-06-18 14:19:14 +01:00
Kai Fricke	16381625db	[tune] Reduce default number of maximum pending trials to max(16, cluster_cpus) (#15628 )	2021-05-05 15:54:27 +01:00
Kai Fricke	d33b0e4bc3	[tune] Reconcile placement groups every N seconds to avoid bottlenecks when running many short trials (#15011 ) Closes a release blocking issue	2021-04-01 17:04:44 +02:00
Kai Fricke	84b3c3376b	[tune] document scalability best practices (k8s, scalability thresholds) (#14566 ) Adds a new page and table to document current scalability thresholds in Ray Tune to the documentation. Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-03-25 09:54:14 +01:00
Kai Fricke	898243d538	[tune] Limit maximum number of pending trials. Add convergence test. (#14835 )	2021-03-23 18:19:41 -07:00
Kai Fricke	757866ec01	[tune] enable placement groups per default (#13906 ) * Refactor placement group factory object to accept placement_group arguments instead of callables * Convert resources to pgf * Enable placement groups per default * Fix tests WIP * Fix stop/resume with placement groups * Fix progress reporter test * Fix trial executor tests * Check resource for trial, not resource object * Move ENV vars into class * Fix tests * Sphinx * Wait for trial start in PBT * Revert merge errors * Support trial reuse with placement groups * Better check for just staged trials * Fix trial queuing * Wait for pg after trial termination * Clean up PGs before tune run * No PG settings in pbt scheduler * Fix buffering tests * Skip test if ray reports erroneous available resources * Disable PG for cluster resource counting test * Debug output for tests * Output in-use resources for placement groups * Don't start new trial on trial start failure * Add docs * Cleanup PGs once futures returned * Fix placement group shutdown * Use updated_queue flag * Apply suggestions from code review * Apply suggestions from code review * Update docs * Reuse placement groups independently from actors * Do not remove placement groups for paused trials * Only continue enqueueing trials if it didn't fail the first time * Rename parameter * Fix pause trial * Code review + try_recover * Update python/ray/tune/utils/placement_groups.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Move placement group lifecycle management * Move total used resources to pg manager * Update FAQ example * Requeue trial if start was unsuccessful * Do not cleanup pgs at start of run * Revert "Do not cleanup pgs at start of run" This reverts commit 933d9c4c * Delayed PG removal * Fix trial requeue test * Trigger pg cleanup on status update * Fix tests * Fix docs * fix-test Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-02-23 18:46:02 +01:00
javi-redondo	b8b2d6410d	[docs] new Ray Cluster documentation (#13839 ) Co-authored-by: Javier Redondo <javier@anyscale.com> Co-authored-by: AmeerHajAli <ameerh@berkeley.edu>	2021-02-15 00:47:14 -08:00
Kai Fricke	d29fcfb45c	[tune] catch SIGINT signal and trigger experiment checkpoint (#13767 ) * [tune] catch SIGINT signal and trigger experiment checkpoint * Apply suggestions from code review * Fix user guide docs * Update doc/source/tune/user-guide.rst	2021-02-02 14:52:09 +01:00
Kai Fricke	dc42abb2f5	[tune] placement group support (#13370 )	2021-01-18 11:58:57 -08:00
Richard Liaw	86387504ee	[tune] fix small docs typo (#13355 ) Signed-off-by: Richard Liaw <rliaw@berkeley.edu>	2021-01-16 00:49:17 -08:00
Kai Fricke	518427627b	[tune] buffer trainable results (#13236 ) * Working prototype * Pass buffer length, fix tests * Don't buffer per default * Dispatch and process save in one go, added tests * Fix tests * Pass adaptive seconds to train_buffered, stop result processing after STOP decision * Fix tests, add release test * Update tests * Added detailed logs for slow operations * Update python/ray/tune/trial_runner.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Revert tests and go back to old tuning loop * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-01-12 18:52:47 +01:00
Edwin Goh	a5ddc27bab	Fix typo in Tune Docs (Checkpointing) (#13348 ) See issue #13299	2021-01-11 20:27:18 -08:00
Kai Fricke	5f04ade6ef	[tune] add more stoppers and stopper documentation (#12750 ) * Add new stoppers & docs * Add tests for maximum iteration stopper and trial plateau stopper * Update python/ray/tune/stopper.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/tune/api_docs/stoppers.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/tune/api_docs/stoppers.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Apply suggestions from code review * Update python/ray/tune/stopper.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-12-12 01:47:19 -08:00
Richard Liaw	9ce7ad17fd	[tune] remove some bottlenecks in trialrunner (#12476 )	2020-11-30 14:54:25 -08:00
Richard Liaw	e59fe65d3d	[tune] Fix logging for dockersyncer (#12196 )	2020-11-23 14:29:41 -08:00
Keqiu Hu	0c1bdaef59	[tune] TensorFlow Distributed Trainable (#11876 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-11-10 14:59:08 -08:00
Kai Fricke	603accf1c2	[tune] logger refactor part 3: Add ExperimentLogger class (#11749 )	2020-11-05 08:55:38 -08:00
Frank Gu	73fa94731f	[tune] Add HDFS as Cloud Sync Client (#11524 )	2020-10-22 14:12:51 -07:00
Richard Liaw	a4b418d30c	[docs] update cloud docs (#11262 ) * update-cloud-docs Signed-off-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/config.rst Co-authored-by: Ian Rodney <ian.rodney@gmail.com> * fix Signed-off-by: Richard Liaw <rliaw@berkeley.edu> * fix Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Ian Rodney <ian.rodney@gmail.com>	2020-10-21 16:37:26 -07:00
Richard Liaw	56f858ed1a	[tune][docs/util] gputil check, docs (#11260 ) Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2020-10-10 00:54:31 -07:00
Kai Fricke	b450cb030a	[tune] reuse actors for function API (#11230 ) Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>	2020-10-08 16:15:02 -07:00
Kai Fricke	bdf647c4ec	[tune] docker syncer (#11035 ) * Add DockerSyncer * Add docs * Update python/ray/tune/integration/docker.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Updated docs * fix dir * Added docker integration test * added docker integration test to bazel build * Use sdk.rsync API Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-10-01 11:59:23 -07:00
Kai Fricke	c77cfaa5ad	[tune] use dated experiment dir per default (#11104 )	2020-09-30 14:43:59 -07:00
Kai Fricke	e7315b0856	[tune] Callbacks for tune runs (#11001 )	2020-09-27 16:50:07 -07:00
Richard Liaw	a563344bc2	[docs] remove ref to google groups -> github discussions (#11019 )	2020-09-24 18:09:51 -07:00
Kai Fricke	d9c4dea7cf	[tune] strict metric checking (#10972 )	2020-09-24 10:00:48 -07:00
Richard Liaw	b0ca70f628	[tune+core] tune lifecycle and starting ray guide (#10813 )	2020-09-21 11:27:50 -07:00
Ameer Haj Ali	6edacb22b8	Fix abstraction violations in command_runner interface (#10715 ) * Fix abstraction violations in command_runner interface * user guide * lint * breaking abstraction in commands * extra initialization commands * more cleanup * small fixes * fix test_integration_kubernetes.py * lint Co-authored-by: root <root@ip-172-31-28-155.us-west-2.compute.internal> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>	2020-09-14 20:28:38 -07:00
Max Fitton	017737b82b	[Documentation] `local_mode` doc updates and actor / worker explanation from Slack (#10748 ) * wip * Update local mode docs in all locations * Update doc/source/actors.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/actors.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Change duplicated text to links to a subtitle for local_mode * change a reference to be explicit * Apply suggestions from code review Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-09-14 13:19:38 -07:00

1 2

60 commits