Commit graph

3334 commits

Author SHA1 Message Date
zhu-eric
3845c97dd0 [doc] Hyperparameter Tuning Gallery Entry (#5786)
* mod_table

* Example fix for gallery

* lint

* nit

* nit

* fix

* gallery

* remove table for now

* training, object store, tune, actors, advanced

* start tf code

* first cut tf

* yapf

* pytorch

* add torch example

* torch

* parallel

* tune

* tuning

* reviewsready

* finetune

* fix

* move_code

* update conf

* compile

* init hyperparameter

* Start images

* overview

* extra

* fix

* works

* update-ps-example

* param_actor

* fix

* examples

* simple

* simplify_pong

* flake8 and run hyperopt

* add comments

* add comments

* add suggestion

* add suggestion

* suggestions

* add suggestion

* add suggestions

* fixed in wrong area

* last edit

* finish changes

* add line

* hyperparameter
2019-10-08 14:13:17 -07:00
Matthew A. Wright
4aa06918ae Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00
Edward Oakes
42dd0fae96
Fix actor ID collision in local mode (#5863)
* Fixed local mode actor id

* Update python/ray/actor.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Added hyphen to match comments

* Added tests to test_local_mode

* Helloworld

* Better test naming

* lint
2019-10-08 13:07:42 -07:00
Ujval Misra
375852af23 [tune] Check node liveness before result fetch (#5844)
* Check if trial's node is alive before trying to fetch result

* Added function for failed trials to trial_executor interface

* Address comments, add test.
2019-10-08 11:41:01 -07:00
waldroje
054583ffe6 [tune] MedianStopping on result (#5402)
* added class median_stopping_result to schedulers and updated __init__

* Dicts flatten and combine schedulers.

MedianStoppingRule is now combined with MedianStoppingResult; I think
the functionality is essentially the same so there's no need to
duplicate.

Dict flattening was already taken care of in a separate PR, so I've
reverted that.

* lint

* revert

* remove time sharing and simplify state

* fix

* fixtests

* added class median_stopping_result to schedulers and updated __init__

* update property names and types to reflect suggestions by ray developers, merged get_median_result and get_best_result into a single method to eliminate duplicate steps, added resource check on PAUSE condition, modified utility function to use updated properties

* updated tests for median_stopping_result in separate file

* remove stray characters from previous merge conflict

* reformatted and cleaned up dependencies from running code format and linting

* added class median_stopping_result to schedulers and updated __init__

* Dicts flatten and combine schedulers.

MedianStoppingRule is now combined with MedianStoppingResult; I think
the functionality is essentially the same so there's no need to
duplicate.

Dict flattening was already taken care of in a separate PR, so I've
reverted that.

* lint

* revert

* remove time sharing and simplify state

* fix

* added class median_stopping_result to schedulers and updated __init__

* update property names and types to reflect suggestions by ray developers, merged get_median_result and get_best_result into a single method to eliminate duplicate steps, added resource check on PAUSE condition, modified utility function to use updated properties

* updated tests for median_stopping_result in separate file

* remove stray characters from previous merge conflict

* reformatted and cleaned up dependencies from running code format and linting

* update scheduler to coordinate eval interval

* modify median_stopping_result to synchronize result evaluation at regular intervals, driven by least common interval

* add some logging info to median_result

* add new scheduler, SyncMedianStoppingResult, which evaluates and stops trials in a synchronous fashion

* Cleanup median_stopping_rule

- remove eval_interval
- pause trials with insufficient samples if there are other waiting trials
- compute score only for trials that have reached result_time

* Remove extraneous classes

* Fix median stopping rule tests

* Added min_time_slice flag to reduce potential checkpointing cost

* Only compute mean after grace

* Relegate logging to debug mode
2019-10-08 11:40:41 -07:00
Edward Oakes
486abedcdf
Link to kubernetes config files in docs (#5865) 2019-10-08 11:06:25 -07:00
Philipp Moritz
785670bc18
Fix class attributes and methods for actor classes (#5802) 2019-10-07 23:56:07 -07:00
Edward Oakes
08e4e3a153
[core worker] Submit Python actor tasks through core worker (#5750)
* Submit actor tasks through core worker

* Fix java

* add comment

* Remove task builder

* Check negative

* Increase -> Increment

* pass by reference

* fix signal

* Clean up c++ actor handle

* more cleanup

* Clean up headers

* Fix unique_ptr construction

* Fix java

* Move profiling to c++

* dedup

* fix error

* comments

* fix java

* Fix tests

* wait for actor to exit

* Start after constructor

* ignore java build

* fix comment

* always init logging

* Fix logging

* fix logging issue

* shared_ptr for profiler

* DEBUG -> WARNING

* fix killed_ init

* Fix flaky checkpointing tests

* -v flag for tune tests

* Fix checkpoint test logic

* Fix exception matching

* timeout exception

* Fix test exception info

* Fix import

* fix build

* Fix test

* shared_ptr
2019-10-07 15:42:19 -07:00
Eric Liang
04e997fe0d
Fix TF2 / rllib test (#5846) 2019-10-07 14:25:16 -07:00
Simon Mo
9bb3633cd9
[Serve] Implement metric interface (#5852)
* Implement metric interface

* Address comment: made actor_handles a dict

* Fix iteration

* Lint

* Mark lightweight actors as num_cpus=0 to prevent resource starvation

* Be more explicit about the readiness condition

* Make task_runner non-blocking

* Lint
2019-10-07 09:29:26 -07:00
Simon Mo
25dde48607
[Serve] Implement replica scaling (#5850)
* Implement replica scaling

* Lint

* Fix .travis.yml so it won't skip if only serve affected
2019-10-07 01:57:31 -07:00
Erik Cederstrand
5834c56c64 Restore support for Python 3.5 (#5818)
* Advertise that Python >= 3.6 is needed

ray/tune/examples/ax_example.py contains f-strings which limits support of this package to Python 3.6 and up.

* Python 3.5 does not support f-strings

Rewrite by using format()

* Lower required version after 9f88fe9d

* Remove python_requires again by request

* Fix linter warning
2019-10-07 00:10:00 -07:00
Simon Mo
e8570874b6
[Serve] Implement flask_request and named python request (#5849)
* Implement flask_request and named python request

* Forgot to include missing files

* Address comment

* Add flask to requirements for doc (lint failed)

* Update doc requirement so lint will build

* Install flask in CI

* Fix typo in .travis.yml
2019-10-06 15:12:30 -07:00
Anthony Yu
b99cdf4e39 [tune] PBT + Memnn example (#5723)
* Add example file

* Move into train function

* Somewhat working example of MemNN, still has some failed trials

* Reorganize into a class

* Small fixes

* Iteration decrease and fix hyperparam_mutations

* Add example file

* Move into train function

* Somewhat working example of MemNN, still has some failed trials

* Reorganize into a class

* Small fixes

* Iteration decrease and fix hyperparam_mutations

* Some style edits

* Address PR changes without modifying learning rate

* Add configs and hyperparameter mutations

* Add tune test

* Modify import locations

* Some parameter changes for testing

* Update memnn example

* Add tensorboard support and address PR comment

* Final changes

* lint

* generator
2019-10-05 09:22:37 -07:00
Eric Liang
fb33160df8
Fix obs space lo/hi (#5826) 2019-10-04 09:28:06 -07:00
Edward Oakes
17c6835c3f
Just die on signal (#5842) 2019-10-03 18:21:21 -07:00
Edward Oakes
8ca7fab581
Improve manual Kubernetes deployment documentation (#5582)
* Add ray-cluster, modify submit

* Add comments

* Job submission working

* Write docs

* Add link to autoscaling

* Fix wget link in job

* Use namespace file

* match tense

* fix tab

* Improve job documentation

* comments

* Fix link

* Fix links

* comments

* add overview paragraph

* Update imagePullPolicy

* Warning if no cluster running

* better check
2019-10-03 15:47:49 -07:00
Si-Yuan
3a42780cb8
Improved Pickle5 pickling (#5841)
* object copy optimization

* see if we can reuse the Arrow parallel_memcopy

* remove unused function

* restore the original code, since later experiments show that it has little impact on performance.

* lint
2019-10-03 15:14:32 -07:00
Simon Mo
fa1214c44a
[Serve] First iteration of the serve doc (#5834)
* Address comments

* Lint

* Add py3 warning
2019-10-03 15:14:09 -07:00
Philipp Moritz
0dee225ce1
Make it possible to run ray examples as projects (#5816) 2019-10-03 14:52:37 -07:00
Edward Oakes
972dddd776
[autoscaler] Kubernetes autoscaler backend (#5492)
* Add Kubernetes NodeProvider to autoscaler

* Split off SSHCommandRunner

* Add KubernetesCommandRunner

* Cleanup

* More config options

* Check if auth present

* More auth checks

* Better output

* Always bootstrap config

* All working

* Add k8s-rsync comment

* Clean up manual k8s examples

* Fix up submit.yaml

* Automatically configure permissisons

* Fix get_node_provider arg

* Fix permissions

* Fill in empty auth

* Remove ray-cluster from this PR

* No hard dep on kubernetes library

* Move permissions into autoscaler config

* lint

* Fix indentation

* namespace validation

* Use cluster name tag

* Remove kubernetes from setup.py

* Comment in example configs

* Same default autoscaling config as aws

* Add Kubernetes quickstart

* lint

* Revert changes to submit.yaml (other PR)

* Install kubernetes in travis

* address comments

* Improve autoscaling doc

* kubectl command in setup

* Force use_internal_ips

* comments

* backend env in docs

* Change namespace config

* comments

* comments

* Fix yaml test
2019-10-03 10:17:00 -07:00
Ujval Misra
9df6eda84f [tune] Add error case for member functions passed as stopping c… (#5823) 2019-10-03 09:49:03 -07:00
Si-Yuan
2fb7d7846f Initial implementation of Cython pickle5 support (#5725) 2019-10-03 09:20:26 -07:00
Philipp Moritz
9a71d6ce3a
Build dashboard only once in the wheel build and make sure caching is working for wheel builds (#5784)
* build dashboard only once

* update

* debug

* caching?

* update

* update
2019-10-02 16:29:11 -07:00
Edward Oakes
4e049232a8
shared_ptr (#5830) 2019-10-02 16:29:04 -07:00
Philipp Moritz
26834bcf94
Add message about tests passing and flaky tests to PR template (#5833) 2019-10-02 15:23:34 -07:00
Edward Oakes
ef1a61ab57
Log output in test_dead_actors.py (#5831) 2019-10-02 14:40:55 -07:00
Stephanie Wang
dc80e6be3d Add screen argument (#5808) 2019-10-01 15:18:19 -07:00
Edward Oakes
963bbe8bbd
Move profiling to c++ (#5771)
* Move profiling to c++

* comments

* Fix tests

* Start after constructor

* fix comment

* always init logging

* Fix logging

* fix logging issue

* shared_ptr for profiler

* DEBUG -> WARNING

* fix killed_ init

* Fix flaky checkpointing tests

* Fix checkpoint test logic

* Fix exception matching

* timeout exception

* Fix import

* fix build

* use boost::asio

* fix double const

* Properly reset async_wait

* remove SIGINT

* Change error message

* increase timeout

* small nits

* Don't trap on SIGINT

* -v for tune

* Fix test
2019-10-01 10:06:25 -07:00
Edward Oakes
443feb75f0 Fix test (#5810) 2019-09-30 19:39:53 -07:00
Richard Liaw
e54c487d18 [hotfix] Docker (#5809)
* configspace

* reorder
2019-09-30 16:39:00 -07:00
Wenjie Wu
ccd88c9e20 [doc] fix typo in ASHA blog url (#5801)
this fix issue #5800
2019-09-29 17:41:18 -07:00
Eric Liang
81ee887f91
Preserve the original exception type when converting to RayTaskError (#5799) 2019-09-28 17:03:15 -07:00
Eric Liang
493364d3bd
[autoscaler] Add unit tests for stopped node caching, fix flaky tests (#5793) 2019-09-27 22:36:09 -07:00
Edward Oakes
86610a30c9
[flaky test] Fix flaky checkpointing tests (#5791)
* Fix flaky checkpointing tests

* Fix checkpoint test logic

* Fix exception matching

* timeout exception

* Fix import

* fix build
2019-09-27 11:03:07 -07:00
Richard Liaw
baf85c6665
[tune/sgd] Fix Jenkins (#5765) 2019-09-27 09:59:08 -07:00
Eric Liang
b5da32df78 Bump Ray version in documentation to dev5 (#5794) 2019-09-27 00:19:17 -07:00
Richard Liaw
5c549fd84b
[docs] Make slack more prominent (#5792)
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
2019-09-26 15:36:56 -07:00
Philipp Moritz
01d6362472
Serialize StringIO with pickle (#5781) 2019-09-26 12:55:14 -07:00
Philipp Moritz
57a5871ea6
Convert long running stress tests to projects (#5641) 2019-09-26 11:25:09 -07:00
Eric Liang
5ecb02fb80
Release 0.7.5 updates (#5727) 2019-09-26 10:30:37 -07:00
Edward Oakes
8a33891a40
Include object size in full error (#5782) 2019-09-25 17:04:17 -07:00
Robert Nishihara
ddfe9439c8
And sphinx-gallery requirement to readthedocs. (#5780) 2019-09-25 14:46:56 -07:00
Robert Nishihara
18ce7bda2b
Fix flaky test_actors_and_tasks_with_gpus_version_two test. (#5756) 2019-09-25 11:47:47 -07:00
Edward Oakes
d499601bd7 Fix flaky checkpoint tests (#5778) 2019-09-25 10:55:17 -07:00
Eric Liang
c6919d315d
[rllib] Remove TorchPolicy locks (#5764)
* remove torch lock

* remove lock
2019-09-24 17:52:16 -07:00
Richard Liaw
10f21fa313
[docs] Convert Examples to Gallery (#5414) 2019-09-24 15:46:56 -07:00
Zhijun Fu
ea9376c9ce Fix flaky core worker tests because of race condition in gcs client subscription (#5735) 2019-09-24 22:47:38 +08:00
Kai Yang
c580955840 [Java] Fix some potential bugs about Ray.shutdown() (#5693) 2019-09-24 10:44:17 +08:00
Ujval Misra
a4659a8f8b [tune] Add support for function-based stopping condition (#5754) 2019-09-23 18:39:00 -07:00