Commit graph

318 commits

Author SHA1 Message Date
Philipp Moritz
a3f8fa426b Start integrating new GCS APIs (#1379)
* Start integrating new GCS calls

* fixes

* tests

* cleanup

* cleanup and valgrind fix

* update tests

* fix valgrind

* fix more valgrind

* fixes

* add separate tests for GCS

* fix linting

* update tests

* cleanup

* fix python linting

* more fixes

* fix linting

* add plasma manager callback

* add some documentation

* fix linting

* fix linting

* fixes

* update

* fix linting

* fix

* add spillback count

* fixes

* linting

* fixes

* fix linting

* fix

* fix

* fix
2018-01-31 11:01:12 -08:00
Robert Nishihara
4c6dae5517 Raise an exception in Jenkins tests after a timeout. (#1477) 2018-01-27 20:21:27 -08:00
Robert Nishihara
3195c6aa63 Fix local scheduler crash when driver creates actor and exits. (#1474)
* Make check failures in redis.cc more informative.

* Fix bug by calling task_table_add_task.

* Add test.
2018-01-26 14:29:53 -08:00
Kaahan
7aa979a024 [tune] Added Population Based Training (#1355)
Adds a Population-Based Training (as described in https://arxiv.org/abs/1711.09846) scheduler to Ray.tune. Currently mutates hyperparameters according to either a user-defined list of possible values to mutate to (necessary if hyperparameters can only be certain values ex. sgd_batch_size), or by a factor of 0.8 or 1.2.
2018-01-25 21:38:37 -08:00
Richard Liaw
e5c4d9ea0c
[tune] Fix Trial Logging File name (#1466) 2018-01-25 17:57:40 -08:00
Robert Nishihara
ab5d4a6010 Bring cloudpickle inside the repository. (#1445)
* Bring cloudpickle version 0.5.2 inside the repo.

* Use internal copy of cloudpickle everywhere.

* Fix linting.

* Import ordering.

* Change __init__.py.

* Set pickler in serialization context.

* Don't check ray location.
2018-01-25 11:36:37 -08:00
Eric Liang
173f1d629a
[tune] Ray Tune API cleanup (#1454)
Remove rllib dep: trainable is now a standalone abstract class that can be easily subclassed.

Clean up hyperband: fix debug string and add an example.

Remove YAML api / ScriptRunner: this was never really used.

Move ray.init() out of run_experiments(): This provides greater flexibility and should be less confusing since there isn't an implicit init() done there. Note that this is a breaking API change for tune.
2018-01-24 16:55:17 -08:00
Richard Liaw
a7d544424c [tune] Experiment Management API (#1328)
* init for exposing external interface

* revisions

* http server

* small

* simplify

* ui

* fixes

* test

* nit

* nit

* merge

* untested

* nits

* nit

* init tests

* tests

* more tests

* nit

* fix hyperband

* cleanup

* nits

* good stuff

* cleanup

* comments and need to test

* nit

* notebook

* testing

* test and expose server

* server_tests

* docs

* periods

* fix tests

* committing test

* fi
2018-01-24 13:45:10 -08:00
Eric Liang
1d2a28ab07 [rllib] test all combinations of {obs_space} x {action_space} (#1449) 2018-01-24 11:03:43 -08:00
Robert Nishihara
f32c0c8ec1 Move calls to ray.worker.cleanup into tearDown part of tests for isolation. (#1433) 2018-01-22 22:54:56 -08:00
Devin Petersohn
4aca016bff Adding series and a way to validate our API. (#1435)
* Adding series and a way to validate our API.

* Moving partitions into protected status
2018-01-21 19:20:54 -08:00
Stephanie Wang
74718efa73
Nondeterministic reconstruction for actors (#1344)
* Add failing unit test for nondeterministic reconstruction

* Retry scheduling actor tasks if reassigned to local scheduler

* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction

* Fix bug for updating checkpoint task execution dependencies

* Update comments for deterministic reconstruction

* cleanup

* Add (and skip) failing test case for nondeterministic reconstruction

* Suppress test output
2018-01-21 13:44:13 -08:00
eugenevinitsky
37076a9ff8 Multiagent model using concatenated observations (#1416)
* working multi action distribution and multiagent model

* currently working but the splits arent done in the right place

* added shared models

* added categorical support and mountain car example

* now compatible with generalized advantage estimation

* working multiagent code with discrete and continuous example

* moved reshaper to utils

* code review changes made, ppo action placeholder moved to model catalog, all multiagent code moved out of fcnet

* added examples in

* added PEP8 compliance

* examples are mostly pep8 compliant

* removed all flake errors

* added examples to jenkins tests

* fixed custom options bug

* added lines to let docker file find multiagent tests

* shortened example run length

* corrected nits

* fixed flake errors
2018-01-18 19:51:31 -08:00
Richard Liaw
d4592382a4
[tune][minor] Fixes (#1383) 2018-01-11 18:14:20 -08:00
Philipp Moritz
44792530a9 fix autoscaler test (#1411) 2018-01-10 13:18:34 -08:00
Devin Petersohn
112ef07563 Adding all DataFrame methods with NotImplementedErrors (#1403)
* Adding all DataFrame methods with NotImplementedErrors

* Moving dataframe creation into function call
2018-01-07 12:00:16 -08:00
Eric Liang
b6c42f96be
Auto-scale ray clusters based on GCS load metrics (#1348)
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:

Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
2017-12-31 14:39:57 -08:00
Devin Petersohn
a75a473d7f Add a distributed Dataframe API to Ray (#1330)
* Adding dataframe object and minor APIs

* Adding reduce functionality

* Adding some print and making reduce work on current Ray

* Cleanup

* Added new functionality and docs.

* Adding more functionality.

* New functionality with older cleanup

* Complying with flake8 formatting

* Added tests and addressed reviewer comments

* Complying with flake8.

* Adding pandas to travis and requirements doc

* Fixing flake8 failures

* Fixing flake8 errors from imports

* Fixing import error

* Fixing import errors

* Addressing reviewer comments

* Addressing lint error
2017-12-20 09:31:22 -08:00
Eric Liang
47b1f02d3e [rllib] Pull out multi-gpu optimizer as a generic class (#1313) 2017-12-17 15:59:57 -08:00
Eric Liang
f5ea44338e EC2 cluster setup scripts and initial version of auto-scaler (#1311) 2017-12-15 23:56:39 -08:00
Eric Liang
fbf1806b8a
[tune] Clean up result logging: move out of /tmp, add timestamp (#1297) 2017-12-15 14:19:08 -08:00
Robert Nishihara
f75b51d178 Register Common.error with local scheduler extension module. (#1316)
* Register Common.error with local scheduler extension module.

* Add test.
2017-12-13 11:55:54 -08:00
Peter Schafhalter
20d6b74aa6 [rllib] Added evaluation script to RLLib (#1295) 2017-12-11 11:59:44 -08:00
Robert Nishihara
96463c680c Allow actor methods to return multiple object IDs. (#1296)
* Allow actor methods to return multiple object IDs.

* Add test.

* Fixes

* Remove outdated comment.

* Add comment and assert
2017-12-09 10:37:57 -08:00
Philipp Moritz
26125e1547 Fixing the jenkins tests (#1299)
* trying to fix jenkins tests

* comment out more tests

* remove pytorch stuff

* use non-monotonic clock (monotonic not supported on python 2.7)

* whitespace
2017-12-07 17:03:58 -08:00
Eric Liang
2d543b6e19
[rllib] Refactor DQN to use an Evaluator abstraction (#1276)
This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
2017-12-06 17:51:57 -08:00
Robert Nishihara
c21e189371 Allow scheduling with arbitrary user-defined resource labels. (#1236)
* Enable scheduling with custom resource labels.

* Fix.

* Minor fixes and ref counting fix.

* Linting

* Use .data() instead of .c_str().

* Fix linting.

* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.

* Sleep in test so that all tasks are submitted before any completes.
2017-12-01 11:41:40 -08:00
Eric Liang
37831ae0c3 Add a nicer warning message when you pass the wrong thing to ray.wait() (#1239)
* add warnings

* fix python mode

* Small changes and add tests.

* Fix test failure.
2017-11-27 22:57:33 -08:00
Robert Nishihara
2865128df0 Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260)
* Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines.

* Fix linting.
2017-11-26 18:29:10 -08:00
Robert Nishihara
0b4961b161 Provide flag for setting redis maxclients. (#1257)
* Add flag for attempting to increase ulimit -n and the redis maxclients.

* Don't bother trying to set ulimit -n.

* Fix linting.

* Add basic test.
2017-11-26 18:25:55 -08:00
Robert Nishihara
7af5292646 Give error if a worker has a version mismatch for Python Ray, or clou… (#1245)
* Give error if a worker has a version mismatch for Python Ray, or cloudpickle.

* Check version when attaching driver to cluster.

* Only do check if the version info is present.

* Bug fix.

* Fix typo.
2017-11-23 23:31:03 -08:00
Robert Nishihara
477a40f76d Prohibit returning actor handles and also update actor documentation. (#1246)
* Prohibit returning actor handles and also update actor documentation.

* Clarify documentation.
2017-11-23 09:37:24 -08:00
shane
9af8dc568a testing with --rm and docker run (#1240)
Add --rm to docker run for Jenkins tests.
2017-11-22 10:20:04 -08:00
Eric Liang
316f9e2bb7 [tune] Support user-defined trainable functions / classes / envs with a shared object registry (#1226) 2017-11-20 17:52:43 -08:00
Eric Liang
9233e496cc Raise exception when getting the task results of workers that died (#1224)
* wip

* with test

* add timeout

* also add test for f

* remove on cleanup

* update

* wip

* fix tests

* mark actor removed in redis

* clang-format

* fix bug when no-inprogress tasks

* try to set task status done

* Add comment.
2017-11-20 15:18:39 -08:00
Eric Liang
28f1e12940 [rllib] [build-fix] ES iterations get unexpectedly long (#1235)
* fix very long es

* Revert prior change.

* Shorten ES jenkins tests.
2017-11-20 14:42:42 -08:00
Robert Nishihara
0eae917766 [rllib] Clean up evolution strategies example. (#1225)
* Remove ES observation statistics.

* Consolidate policy classes.

* Remove random stream.

* Move rollout function out of policy.

* Consolidate policy initialization.

* Replace act implementation with sess.run.

* Remove tf_utils.

* Remove variable scope.

* Remove unused imports.

* Use regular TF session.

* Use MeanStdFilter.

* Minor.

* Clarify naming.

* Update documentation.

* eps -> episodes

* Report noiseless evaluation runs.

* Clean up naming.

* Update documentation.

* Fix some bugs.

* Make it run on atari.

* Don't add action noise during evaluation runs.

* Add ES to checkpoint/restore test.

* Small cleanups and remove redundant calls to get_weights.

* Remove outdated comment.
2017-11-16 21:58:30 -08:00
Richard Liaw
eadb998643
[tune] Make HyperBand Usable (#1215) 2017-11-16 10:31:42 -08:00
Richard Liaw
71f8cd2403
[tune] Fixing up Hyperband (#1207)
* Fixing up Hyperband

* nit

* cleanup

* Timing test Added

* added_exception_back

* fixup_tests

* reverse placement

* fixes_and_tests

* fix

* fix

* fixlint

* cleanup_timing

* lint

* Update hyperband.py
2017-11-12 12:05:32 -08:00
Eric Liang
7c38f964b7 [tune] Add command line support for choosing early stopping schedulers (#1209)
* command line support

* add checkpoint freq

* fix other flags

* fix

* docs

* doc
2017-11-12 12:05:18 -08:00
Richard Liaw
afdc87323f
[rllib] PyTorch Models for A3C (#1187)
* fixing policy

* Compute Action is singular, fixed weird issue with arrays

* remove vestige

* extraneous ipdb

* Can Drop in Pytorch Model

* lint

* introducing models

* fix base policy

* Missed this from last time

* lint

* removedolds

* getting vision working

* LINT

* trying to fix test dependencies

* requiremnets

* try

* tryconda

* yes

* shutup

* flake_passes

* changes

* removing weight initializer for lstm for now

* unused

* adam

* clip

* zero

* properscaling

* weight

* try

* fix up pytorch visionnet

* bias correction

* fix model

* same visionnet

* matching_bad_things

* test

* try locking

* fixing_linear

* naming

* lint

* FORJENKINS

* clouds

* lint

* Lint + removed dependencies

* removed dependencies

* format
2017-11-12 00:20:33 -08:00
Daniel Suo
4f0da6f81c Add basic functionality for Cython functions and actors (#1193)
* Add basic functionality for Cython functions and actors

* Fix up per @pcmoritz comments

* Fixes per @richardliaw comments

* Fixes per @robertnishihara comments

* Forgot double quotes when updating masked_log

* Remove import typing for Python 2 compatibility
2017-11-09 17:49:06 -08:00
Richard Liaw
6197b260b8 Fix Jenkins issue introduced by Variant Generator (#1194)
* try fix

* shorten

* added a flag

* finish

* Fix linting.
2017-11-09 00:56:20 -08:00
Eric Liang
52888e4c6f [tune] Improve the tune Python API and variant generation (#1154)
* new variant gen

* wip

* Sat Oct 21 18:21:34 PDT 2017

* update

* comment

* fix

* update

* update readme

* fix

* Update README.rst

* Update README.rst

* fix repeat

* update

* note on restore
2017-11-06 23:41:17 -08:00
Richard Liaw
6222ec3bd7
[tune] hyperband (#1156)
* trial scheduler interface

* remove

* wip median stopping

* remove

* median stopping rule

* update

* docs

* update

* Revrt

* update

* hyperband untested

* small changes before moving on

* added endpoints

* good changes

* init tests

* smore tests

* unfinished tests

* testing

* testing code

* morbugs

* fixes

* end

* tests and typo

* nit

* try this

* tests

* testing

* lint

* lint

* lint

* comments and docs

* almost screwed up

* lint
2017-11-06 22:30:25 -08:00
Eric Liang
d06beacd84 [tune] Implement median stopping rule (#1170)
* trial scheduler interface

* remove

* wip median stopping

* remove

* median stopping rule

* update

* docs

* update

* Revrt

* update

* comments

* fix tesT
2017-11-03 11:25:02 -07:00
Robert Nishihara
3317d38278 Replace hostnames with numerical IP addresses in redis address. (#1177)
* Replace hostnames with numerical IP addresses in redis address.

* Also do conversion for node_ip_address. Add test.

* Simplifications.
2017-11-01 17:13:22 -07:00
Robert Nishihara
6852e8839e Expose custom serializers through the API. (#1147)
* Expose custom serializers through the API.

* minor renaming

* Add test.

* Remove comment.

* Clean up assertions.
2017-10-29 00:08:55 -07:00
Richard Liaw
797f4fcbf3 Fixing Lint after flake upgrade (#1162)
* Fixing Lint after flake upgrade

* more lint fixes
2017-10-26 21:02:07 -05:00
Eric Liang
cd9dc398ff [rllib] Support discrete observation spaces such as FrozenLake-v0 (#1140)
* add

* remove transform_shape

* fix test

* fix
2017-10-23 23:16:52 -07:00