hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-08 19:41:38 -05:00

Author	SHA1	Message	Date
Ian Rodney	069b121cc1	[docs] Remove Old warning about IOCTL (#8977 )	2020-06-16 18:14:53 -07:00
Markus Cozowicz	b853df7a3b	[autoscaler] Switch to ARM for Azure deployment (#7717 ) * switch to ARM templates for config and VMs * switch to ARM templates for config and VMs * auto-formatting * addressed Scotts comment * added missing imports * fixed gpu templates fixed wheel reference * added missing reference * cleanup wording and yamls * Update doc/source/autoscaling.rst Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com> Co-authored-by: Ubuntu <marcozo@marcozodev2.zqvgrdyupqrudayw1il1agipig.jx.internal.cloudapp.net> Co-authored-by: Scott Graham <5720537+gramhagen@users.noreply.github.com>	2020-04-03 15:51:56 -07:00
Scott Graham	37e4d29f87	[autoscaler] Adding Azure Support (#7080 ) * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * minor fixes * first working version :) * added tag support * added msi identity intermediate * enable MSI through user managed identity * updated schema * extend yaml schema remove service principal code add re-use of managed user identity * fix rg_id * fix logging * replace manual cluster yaml validation with json schema - improved error message - support for intellisense in VSCode (or other IDEs) * run linting * updating yaml configs and formatting * updating yaml configs and formatting * typo in example config * pulling default config from example-full * resetting min, init worker prop * adding docs for azure autoscaler and fixing status * add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment * fix for default subscription in azure node provider * vm dev image build * minor change * keeping example-full.yaml in autoscaler/azure, updating azure example config * linting azure config * extending retries on azure config * lint * support for internal ips, fix to azure docs, and new azure gpu example config * linting * Update python/ray/autoscaler/azure/node_provider.py Co-Authored-By: Richard Liaw <rliaw@berkeley.edu> * revert_this * remove_schema * updating configs and removing ssh keygen, tweak azure node provider terminate * minor tweaks Co-authored-by: Markus Cozowicz <marcozo@microsoft.com> Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-15 14:48:27 -07:00
Markus Cozowicz	145ebe14c7	added Azure Resource Manager (ARM) template (#7494 ) * added Azure Resource Manager (ARM) template * removed Azure doc (moved to separate PR) * nit * fixpaths * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-08 22:29:10 -07:00
Richard Liaw	fc9352c588	[docs] Make walkthrough and starting Ray materials clear (#7099 ) * make starting ray a separate page * concept * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * more fics * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2020-02-11 23:17:30 -08:00
Alex Wu	3f99be8dad	Add 'ray dashboard' command (#6959 )	2020-02-10 12:55:21 -08:00
Richard Liaw	e0078a0d78	[autoscaler][minor] default -> latest_dlami (#6922 ) * config * latest * Update python/ray/autoscaler/aws/config.py	2020-01-27 14:34:07 -08:00
Richard Liaw	2b0e93586f	[autoscaler] Auto-replace "DEFAULT" with most recent DLAMI (#6848 ) * try_this * fix * actual fix * default	2020-01-21 13:54:04 -08:00
Zhuohan Li	f286356e06	[docs] add pages about examples on training language models with fairseq (#5755 ) * add pages about examples on training language models with fairseq and ray autoscaler * better format * update ray_train.sh * Move EFS to the autoscaler file * nits * add comments to the code & use a new way to implement checkpoint hook * small bug fix * polish the doc * fix formatting * yaml * update docs * fix the bugs and add preprocess.sh * fix lint * Reduce batch size & fix lint * shorttitle	2019-10-20 20:28:16 -07:00
Edward Oakes	972dddd776	[autoscaler] Kubernetes autoscaler backend (#5492 ) * Add Kubernetes NodeProvider to autoscaler * Split off SSHCommandRunner * Add KubernetesCommandRunner * Cleanup * More config options * Check if auth present * More auth checks * Better output * Always bootstrap config * All working * Add k8s-rsync comment * Clean up manual k8s examples * Fix up submit.yaml * Automatically configure permissisons * Fix get_node_provider arg * Fix permissions * Fill in empty auth * Remove ray-cluster from this PR * No hard dep on kubernetes library * Move permissions into autoscaler config * lint * Fix indentation * namespace validation * Use cluster name tag * Remove kubernetes from setup.py * Comment in example configs * Same default autoscaling config as aws * Add Kubernetes quickstart * lint * Revert changes to submit.yaml (other PR) * Install kubernetes in travis * address comments * Improve autoscaling doc * kubectl command in setup * Force use_internal_ips * comments * backend env in docs * Change namespace config * comments * comments * Fix yaml test	2019-10-03 10:17:00 -07:00
Eric Liang	2fdefe19b7	Take into account queue length in autoscaling (#5684 )	2019-09-11 11:31:35 -07:00
Eric Liang	1455a19c85	Consolidate and clean up documentation (#5645 )	2019-09-07 11:50:18 -07:00
Eric Liang	a101812b9f	Replace --redis-address with --address in test, docs, tune, rllib (#5602 ) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep	2019-09-01 16:53:02 -07:00
Richard Liaw	a08ea09760	[docs] rewrite (#5175 )	2019-08-05 23:33:14 -07:00
Robert Nishihara	01e18b47f4	Direct people to stackoverflow for questions about usage. (#3830 ) * Direct people to stackoverflow for questions about usage. * Improve wording	2019-01-23 13:30:02 -08:00
Richard Liaw	cc8f7db246	[docs] Improve cluster/docker docs (#3517 ) - Surfaces local cluster usage - Increases visability of these instructions - Removes some docker docs (that are really out of scope for Ray documentation IMO) Closes #3517.	2018-12-12 10:40:54 -08:00
GiliR4t1qbit	454d3aa07d	[docs] Snippet did not have a code-block tag above it (#3442 )	2018-11-30 16:39:40 -08:00
Richard Liaw	c24d87b4d1	[autoscaler] Submit command (#3312 )	2018-11-20 14:03:34 -08:00
Robert Nishihara	e495ab5e7c	Fix some paths /tmp/raylogs -> /tmp/ray. (#3189 )	2018-11-02 12:10:53 -07:00
Richard Liaw	1c9617bc1c	[autoscaler] Add tmux support for attach and exec (#2907 ) Adds a tmux flag that can be used to support background execution of experiments. Cannot be used together with screen. Seems to be useful feature that has shown up with different users.	2018-09-26 23:22:45 -07:00
Eric Liang	9473da69bd	[autoscaler] Experimental support for local / on-prem clusters (#2678 ) This adds some experimental (undocumented) support for launching Ray on existing nodes. You have to provide the head ip, and the list of worker ips. There are also a couple additional utils added for rsyncing files and port-forward.	2018-08-19 12:43:04 -07:00
Eric Liang	079c4e482a	ray exec and ray attach commands (#2560 ) ray exec CLUSTER CMD [--screen] [--start] [--stop] ray attach CLUSTER [--start] Example: ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.	2018-08-15 14:31:50 -07:00
Eric Liang	be178ae031	[autoscaler] GCP docs (#2235 )	2018-06-12 12:40:12 -07:00
Christian Barra	070e27ea7a	Add external module as a node scaler. (#1703 ) * WIP: add external module as a node scaler. * Fix style. * Add tests, fix style issues. * Fix typos. * Fix test error. * Fix node provider path. * Add function to spli pkg from class. * Add doc. * Correct documentation. * Debugging.... * Debugging.... * Add __init__.py to tests. * add more output for debugging * Add more test, fix error with import. * Add a small detail to the documentation. * Update autoscaler.py	2018-03-17 16:59:13 -07:00
Richard Liaw	162d063f0d	[autoscaler/tune] Optional YAML Fields + Fix Pretty Printing for Tune (#1541 )	2018-03-04 23:35:58 -08:00
Eric Liang	80d7def9dc	[autoscaler] [tune] More doc fixes (#1560 ) * Fri Feb 16 13:53:50 PST 2018 * Sat Feb 17 15:32:08 PST 2018 * Sat Feb 17 15:44:59 PST 2018 * fix * Sun Feb 18 14:46:24 PST 2018 * Sun Feb 18 14:46:37 PST 2018 * Sun Feb 18 14:55:52 PST 2018 * Sun Feb 18 15:14:32 PST 2018 * Wed Feb 21 17:34:17 PST 2018 * Sun Feb 25 17:51:17 PST 2018 * Sun Feb 25 22:18:40 PST 2018 * Wed Feb 28 13:19:05 PST 2018 * Wed Feb 28 13:22:13 PST 2018 * Wed Feb 28 13:33:29 PST 2018 * Wed Feb 28 13:35:33 PST 2018 * add ex * Fri Mar 2 12:50:17 PST 2018 * Fri Mar 2 12:54:31 PST 2018	2018-03-03 13:01:49 -08:00
Richard Liaw	1cd2703cac	[autoscaler] Docker Support (#1505 )	2018-02-20 00:24:01 -08:00
Eric Liang	09b29c267d	[autoscaler] some doc updates (#1550 )	2018-02-18 00:53:05 -08:00
Eric Liang	b8811cbe34	[autoscaling] increase connect timeout, boto retries, and check subnet conf (#1422 ) * some autoscaling config tweaks * Sun Jan 14 13:56:55 PST 2018 * Mon Jan 15 14:21:09 PST 2018 * increase backoff * Mon Jan 15 14:40:47 PST 2018 * check boto version	2018-01-16 16:11:09 -08:00
Eric Liang	1bc55e182d	Update the pip wheel in example.yaml and add docs (#1381 )	2018-01-01 13:02:05 -08:00
Eric Liang	43e78217f8	Thu Dec 21 23:19:24 PST 2017 (#1367 )	2017-12-22 17:29:45 -08:00
Eric Liang	f5ea44338e	EC2 cluster setup scripts and initial version of auto-scaler (#1311 )	2017-12-15 23:56:39 -08:00

32 commits