Commit graph

814 commits

Author SHA1 Message Date
Simon Mo
6b04664645
[Serve] Add Tutorial for Batch Inference (#8490) 2020-05-29 09:55:47 -07:00
SangBin Cho
448011f822
0.8.5 Release change. (#8358) 2020-05-28 09:37:19 -07:00
Bill Chambers
fadd47e44e
[docs] Ray Serve Documentation Overhaul (#8524) 2020-05-27 11:03:28 -05:00
Sven Mika
2746fc0476
[RLlib] Auto-framework, retire use_pytorch in favor of framework=... (#8520) 2020-05-27 16:19:13 +02:00
Sven Mika
0422e9c5a8
[RLlib] Add 2 Transformer learning test cases on StatelessCartPole (PPO and IMPALA). (#8624) 2020-05-27 10:19:47 +02:00
Bill Chambers
b3d686b78f
[docs] Add Overview Section & Gentle Introduction (#8517) 2020-05-26 10:39:34 -05:00
Edward Oakes
860eb6f13a
Update named actor API (#8559) 2020-05-24 20:08:03 -05:00
Eric Liang
9a83908c46
[rllib] Deprecate policy optimizers (#8345) 2020-05-21 10:16:18 -07:00
Ian Rodney
f56b3be916
[Docs] Add Cancelation to main docs. (#8508)
* Update walkthrough.rst

* Adding example

* Better example

* Better example

* Adding Ray Kill Info
2020-05-20 10:31:57 -07:00
Bill Chambers
f8f7efc24f
[Serve] Rename RayServe -> "Ray Serve" in Documentation (#8504) 2020-05-19 19:13:54 -07:00
Simon Mo
c9c84c87f4
[Serve] Add Instructions for GPU (#8495) 2020-05-19 18:33:58 -07:00
Max Fitton
13231ba63b
Rename redis-port to port and add default (#8406) 2020-05-18 13:25:34 -05:00
Richard Liaw
b6c4f45ae0
[tune] Fix links (#8477) 2020-05-18 10:08:29 -07:00
Edward Oakes
9a721ed71a
Link to serve in tune overview (#8487) 2020-05-18 11:29:38 -05:00
Sven Mika
796a834c48
[RLlib] Attention Net integration into ModelV2 and learning RL example. (#8371) 2020-05-18 17:26:40 +02:00
Richard Liaw
87cbf2aedd
[docs][tune] Make search algorithm, scheduler docs better! (#8179) 2020-05-17 12:19:44 -07:00
SangBin Cho
2f01776d09
Fix ray memory example (#8462) 2020-05-17 11:34:11 -05:00
Tao Wang
acffdb2349
[TEST]use cc_test to run core_worker_test, enforce/reuse RedisServiceManagerForTest (#8443) 2020-05-17 18:43:00 +08:00
Edward Oakes
fb23bd6fc0
[serve] Optionally namespace serve clusters (#8447) 2020-05-17 00:14:42 -05:00
Richard Liaw
67c01455fe
[tune] tune.track -> tune.report (#8388) 2020-05-16 12:55:08 -07:00
Stephanie Wang
bd169749e0
Option to retry failed actor tasks (#8330)
* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-05-15 20:15:15 -07:00
Edward Oakes
ef498e8aa5
[serve] Add basic session affinity via shard key (#8449) 2020-05-15 16:18:52 -05:00
Max Fitton
00325eb2b2
Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-14 10:30:29 -05:00
Eric Liang
eabb801a40
less important (#8439) 2020-05-13 22:52:38 -07:00
Siyuan (Ryans) Zhuang
ab278071ac
Update serialization doc (#8381)
* update serialization doc
2020-05-12 16:47:00 -07:00
Jason McGhee
24ced808cd
Fix config key in docs for using PyTorch (#8300)
Docs improperly suggest using "torch" when the actual flag is called "use_pytorch"
2020-05-11 12:41:21 -07:00
Eric Liang
f48da50e1c
[rllib] observation function api for multi-agent (#8236) 2020-05-04 22:13:49 -07:00
Rüdiger Busche
e93ec3134a
Use kubectl delete pod in example (#8295)
Co-authored-by: rbusche <rbusche@inserve.de>
2020-05-04 21:39:30 -05:00
Sven Mika
b95e28faea
[RLlib] APEX_DDPG (PyTorch) test case and docs. (#8288)
APEX_DDPG (PyTorch) test case and docs.
2020-05-04 09:36:27 +02:00
Sven Mika
166bb5d690
[RLlib] IMPALA PyTorch (#8287)
This PR adds an IMPALA PyTorch implementation.

- adds compilation tests for LSTM and w/o LSTM.
- adds learning test for CartPole.
2020-05-03 13:44:25 +02:00
Sven Mika
42991d723f
[RLlib] rllib/examples folder restructuring (#8250)
Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well).
2020-05-01 22:59:34 +02:00
Edward Oakes
6373c70661
[serve] Refactor BackendConfig (#8202) 2020-04-30 22:31:07 -05:00
Edward Oakes
95d187e556
[serve] Add delete_endpoint call (#8256) 2020-04-30 20:59:07 -05:00
Edward Oakes
43be73e4cf
[serve] Add delete_backend call (#8252) 2020-04-30 13:10:39 -05:00
mehrdadn
254b1ec370
Set up testing and wheels for Windows on GitHub Actions (#8131)
* Move some Java tests into ci.sh

* Move C++ worker tests into ci.sh

* Define run()

* Prepare to move Python tests into ci.sh

* Fix issues in install-dependencies.sh

* Reload environment for GitHub Actions

* Move wheels to ci.sh and fix related issues

* Don't bypass failures in install-ray.sh anymore

* Make CI a little quieter

* Move linting into ci.sh

* Add vitals test right after build

* Fix os.uname() unavailability on Windows

Co-authored-by: Mehrdad <noreply@github.com>
2020-04-29 21:19:02 -07:00
Simon Mo
101255f782
[Serve] RayServe TF, PyTorch, Sklearn Examples (#8156) 2020-04-28 22:24:55 -07:00
Simon Mo
af3d3e778e
[RayServe] Specify installation instruction in doc (#8220) 2020-04-28 14:38:10 -07:00
Richard Liaw
be5235d982
[tune] Clarify Intro Tune Documentation (#8201) 2020-04-27 18:01:00 -07:00
ijrsvt
a77e5a8cbf
[Doc] Fix Docstring for Task Cancellation (#8198) 2020-04-27 17:06:08 -07:00
Robert Zangnan Yu
a77b19e4f2
[docs] Comments on potential srun orders during Slurm Deployment (#8183) 2020-04-27 09:30:16 -07:00
Richard Liaw
87557a00fa
[tune] Refactor search algorithms (#7037)
* start refactoring of search algorithms

* format

* needs tests

* fix

* suggestions

* Fix PBT

* lint

* refactoring

* hyperopt_working

* dragonfly

* hyperopt

* change_half_of_algs

* save

* code-removed

* remove_lots_of_unneccessary

* changes

* formatting

* suggest

* reset

* rm

* tests

* search-change

* exception

* refactor-doc

* search

* py

* moredocs

* Update doc/source/tune-searchalg.rst

* concurrency

* max

* tune

* betterwarning

* bohb

* tests

* test-change

Co-authored-by: ujvl <misraujval@gmail.com>
2020-04-27 08:51:13 -07:00
Richard Liaw
b506f87117
[tune] New Doc edits, add Concepts page (#8083)
Co-Authored-By: Sven Mika <sven@anyscale.io>
2020-04-25 18:25:56 -07:00
Sven Mika
499ad5fbe4
[RLlib] PyTorch version of APPO. (#8120)
- Translate all vtrace functionality to torch and added torch to the framework_iterator-loop in all existing vtrace test cases.
- Add learning test cases for APPO torch (both w/ and w/o v-trace).
- Add quick compilation tests for APPO (tf and torch, v-trace and no v-trace).
2020-04-23 09:11:12 +02:00
Edward Oakes
505f3a8714
[serve] Remove serve.link(), rename serve.split() -> serve.set_traffic() (#8072) 2020-04-21 14:26:07 -05:00
Sven Mika
d15609ba2a
[RLlib] PyTorch version of ARS (Augmented Random Search). (#8106)
This PR implements a PyTorch version of RLlib's ARS algorithm using RLlib's functional algo builder API. It also adds a regression test for ARS (torch) on CartPole.
2020-04-21 09:47:52 +02:00
Sven Mika
3812bfedda
[RLlib] PyTorch version of ES (Evolution Strategies). (#8104)
PyTorch version of Evolution Strategies (ES) Algo.
2020-04-20 21:47:28 +02:00
Bill Chambers
77655749fb
[RayServe] RayServe Introduction and Overview (#8038) 2020-04-20 12:05:59 -05:00
Sven Mika
165a86f1ab
[RLlib] SAC MuJoCo instability issues (tf and torch versions). (#8063)
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Sumanth Ratna
bdb03a0544
[tune] Update dragonfly installation instructions (#8086)
Closes #8084
2020-04-18 20:25:38 -07:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example (#7792)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* experimental

* as_trainable

* fix

* ok

* format

* create_torch_pbt

* setup_pbt

* ok

* format

* ok

* format

* docs

* ok

* Draft head-is-worker

* Fix missing concurrency between local and remote workers

* Fix tqdm to work with head-is-worker

* Cleanup

* Implement state_dict and load_state_dict

* Reserve resources on the head node for the local worker

* Update the development cluster setup

* Add spot block reservation to the development yaml

* ok

* Draft the fault tolerance fix

* Small fixes to local-remote concurrency

* Cleanup + fix typo

* fixes

* worker_counts

* some formatting and asha

* fix

* okme

* fixactorkill

* unify

* Revert the cluster mounts

* Cut the handler-reporter API

* Fix most tests

* Rm tqdm_handler.py

* Re-add tune test

* Automatically force-shutdown on actor errors on shutdown

* Formatting

* fix_tune_test

* Add timeout error verification

* Rename tqdm to use_tqdm

* fixtests

* ok

* remove_redundant

* deprecated

* deactivated

* ok_try_this

* lint

* nice

* done

* retries

* fixes

* kill

* retry

* init_transformer

* init

* deployit

* improve_example

* trans

* rename

* formats

* format-to-py37

* time_to_test

* more_changes

* ok

* update_args_and_script

* fp16_epoch

* huggingface

* training stats

* distributed

* Apply suggestions from code review

* transformer

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00