Commit graph

971 commits

Author SHA1 Message Date
Eric Liang
bee743c152
Remove log suppression code
When running in a screen (or any other time it is hard to scroll up), printing "Suppressing previous error message" is not helpful since the previous error is lost far above past scrollback. Better to just print it repeatedly at the end.
 tada 1
2018-09-11 23:28:45 -07:00
Kaahan
045861c9b0 [tune] Reset Config for Trainables (#2831)
Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler.

Related issue number: #2741
2018-09-11 08:45:04 -07:00
Peter Schafhalter
5da6e78db1 Add available resources to global state (#2501) 2018-09-10 15:46:32 -07:00
Eric Liang
611259b2c7 Re-raise actor initialization errors on method invocation (#2843)
If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.
2018-09-10 10:51:19 -07:00
Eric Liang
588c573d41 Ray stop needs to kill plasma_store_server not plasma_store (#2850) 2018-09-09 19:23:09 -07:00
eugenevinitsky
9ba751c29a Ars increase (#2844)
* removed cv2

* remove opencv

* increased number of default rollouts ARS

* put cv2 back in this branch

* put cv2 back in this branch

* moved cv2 back where it belongs in preprocessors
2018-09-08 14:09:02 -07:00
Robert Nishihara
bd64c940e9 Push error to driver when monitor raises an exception. (#2834) 2018-09-07 17:42:45 -07:00
Robert Nishihara
3f6ed537a4 Add ray.is_initialized() function. (#2818)
* Add ray.is_initialized() function.

* Add assert.
2018-09-06 21:20:59 -07:00
Eric Liang
e7db54bdb0 Log at INFO level by default (including in autoscaler). (#2824)
Before this change, the autoscaler `up` and related commands don't print any info messages to the console at all. This was a regression from 0.5. @richardliaw @robertnishihara https://github.com/ray-project/ray/issues/2812
2018-09-06 13:31:19 -07:00
Eric Liang
d81605e9e7
[tune] Add a time/timesteps since last restore metric (#2819)
* rsm

* always log to avoid changing schema for csv writer

* add iter since restore

* update

* criteria warn
2018-09-05 17:45:09 -07:00
Eric Liang
995ac24a2c
[rllib] clarify train batch size for PPO (#2793)
It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.
2018-09-05 12:06:13 -07:00
kary
4c0e2c3f58 [rllib]multi agent judge bug (#2821)
* fix multi agent judge bug

* Update policy_evaluator.py
2018-09-04 21:02:06 -07:00
Richard Liaw
72542c9016 [tune] Fix Pausing and Error Propogation (#2815)
* add new tests

* Try-catch errors from ray get

* longer pbt run

* Update pbt_example.py

* Split trial and result and fix tests
2018-09-04 15:22:11 -07:00
Eric Liang
25ffe57a5c
[rllib] Auto-synchronize filters for all agents (#2791)
This makes sure we always update the local filter, and adds an option to synchronize the remote filters as well. In APEX_DDPG we previously didn't do either. The first is needed for checkpoint correctness, the second might help performance.
2018-09-03 20:01:53 -07:00
Eric Liang
01b030bd57
[rllib] throw an error for continuous action spaces in IMPALA
We currently don't support this since the reference vtrace.py does not, though it could be an interesting extension.
2018-09-03 11:12:55 -07:00
Eric Liang
df4788e501
[rllib/tune] Add test for fractional gpu support in xray mode; add rllib support for fractional gpu (#2768)
* frac gpu

* doc

* Update rllib-training.rst

* yapf

* remove xray
2018-09-03 11:12:23 -07:00
Eric Liang
b37a283053 [rllib] support local mode (#2795) 2018-09-02 23:02:19 -07:00
Robert Nishihara
0ac855e061 Push errors to all drivers when node is marked dead. (#2808)
* Push errors to all drivers when node is marked dead.

* Fix
2018-09-02 20:04:58 -07:00
Alexey Tumanov
fdc9688226 [xray] push warning to driver for infeasible tasks (#2784)
This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.
2018-09-01 13:21:27 -07:00
Robert Nishihara
eda6ebb87d Convert some unittests to pytest. (#2779)
* Convert multi_node_test.py to pytest.

* Convert array_test.py to pytest.

* Convert failure_test.py to pytest.

* Convert microbenchmarks to pytest.

* Convert component_failures_test.py to pytest and some minor quotes changes.

* Convert tensorflow_test.py to pytest.

* Convert actor_test.py to pytest.

* Fix.

* Fix
2018-08-31 11:24:15 -07:00
wangyiguang
3813ae34b3 [tune] Add AutoMLBoard: Monitoring UI (experimental) (#2574) 2018-08-31 00:26:44 -07:00
Richard Liaw
0347e6418b
[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708) 2018-08-30 16:18:56 -07:00
Robert Nishihara
224d38cbb2 Name Python threads. (#2767) 2018-08-30 11:08:24 -07:00
Robert Nishihara
5021795190 Update documents to replace 0.5.0 with 0.5.2. (#2761)
* Update documents to replace 0.5.0 with 0.5.1.

* Update documentation from 0.5.1 -> 0.5.2.
2018-08-29 21:05:09 -07:00
Robert Nishihara
f4f3478b45 Bump version number to 0.5.2. (#2765) 2018-08-29 13:39:25 -07:00
Praveen Palanisamy
357c0d6156 [tune] Adds option to checkpoint at end of trials (#2754)
* Added checkpoint_at_end option. To fix #2740

* Added ability to checkpoint at the end of trials if the option is set to True

* checkpoint_at_end option added; Consistent with Experience and Trial runner

* checkpoint_at_end option mentioned in the tune usage guide

* Moved the redundant checkpoint criteria check out of the if-elif

* Added note that checkpoint_at_end is enabled only when checkpoint_freq is not 0

* Added test case for checkpoint_at_end

* Made checkpoint_at_end have an effect regardless of checkpoint_freq

* Removed comment from the test case

* Fixed the indentation

* Fixed pep8 E231

* Handled cases when trainable does not have _save implemented

* Constrained test case to a particular exp using the MockAgent

* Revert "Constrained test case to a particular exp using the MockAgent"

This reverts commit e965a9358ec7859b99a3aabb681286d6ba3c3906.

* Revert "Handled cases when trainable does not have _save implemented"

This reverts commit 0f5382f996ff0cbf3d054742db866c33494d173a.

* Simpler test case for checkpoint_at_end

* Preserved bools from loosing their actual value

* Revert "Moved the redundant checkpoint criteria check out of the if-elif"

This reverts commit 783005122902240b0ee177e9e206e397356af9c5.

* Fix linting error.
2018-08-29 13:14:17 -07:00
Robert Nishihara
132f133214 Limit number of concurrent workers started by hardware concurrency. (#2753)
* Limit number of concurrent workers started by hardware concurrency.

* Check if std:🧵:hardware_concurrency() returns 0.

* Pass in max concurrency from Python.

* Fix Java call to startRaylet.

* Fix typo

* Remove unnecessary cast.

* Fix linting.

* Cleanups on Java side.

* Comment back in actor test.

* Require maximum_startup_concurrency to be at least 1.

* Fix linting and test.

* Improve documentation.

* Fix typo.
2018-08-29 14:53:40 +08:00
Mitar
3850e3ba64 Added extra logging related arguments to "ray start" (#2664) 2018-08-28 23:00:37 -07:00
Eric Liang
69d1354016
[rllib] Document ARS & rainbow (#2744)
* wip

* rainbow doc too

* e not used

* fix ppo doc

* clean list

* use same title
2018-08-28 18:13:36 -07:00
Robert Nishihara
6e1de19cc2 Bump version to 0.5.1. (#2755) 2018-08-28 16:52:17 -07:00
Robert Nishihara
b7722897b4 Deprecate 'driver_mode' argument. (#2758)
* Deprecate 'driver_mode' argument.

* Fix

* Fix
2018-08-28 16:45:49 -07:00
Alexey Tumanov
de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
adoda
90ae8f11df The function get_node_ip_address while catch an exception and return … (#2722)
…'127.0.0.1',

when we forbid the external network. Instead of we can get ip address from hostname.

The function get_node_ip_address while catch an exception and return '127.0.0.1' when we forbid the external network. Instead of we can get ip address from hostname.

https://github.com/ray-project/ray/issues/2721
2018-08-27 22:24:49 -07:00
Yuhong Guo
0b6e08ebee Separate python logger module-wise (#2703)
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.

## Related issue number
https://github.com/ray-project/ray/pull/2664
2018-08-26 13:46:14 -07:00
Richard Liaw
dbba7f2a53
[autoscaler] Cleanup Logging (#2709)
Moves Autoscaler onto Python `logging` module.
2018-08-25 17:08:45 -07:00
Jones Wong
982cde664f [rllib] Add noisy network and distributional Q-learning to implement Rainbow (#2737)
*  add noisy network

*  distributional q-learning in dev

*  add distributional q-learning

*  validated rainbow module

*  add some comments

*  supply some comments

*  remove redundant argument to pass CI test

*  async replay optimizer does NOT need annealing beta

*  ignore rainbow specific arguments for DDPG and Apex

*  formatted by yapf

* Update dqn_policy_graph.py

* Update dqn_policy_graph.py
2018-08-25 14:17:14 -07:00
eugenevinitsky
6201a6d1c7 [rllib] add augmented random search (#2714)
* added ars

* functioning ars with regression test

* added regression tests for ARs

* fixed default config for ARS

* ARS code runs, now time to test

* ARS working and tested, changed std deviation of meanstd filter to initialize to 1

* ARS working and tested, changed std deviation of meanstd filter to initialize to 1

* pep8 fixes

* removed unused linear model

* address comments

* more fixing comments

* post yapf

* fixed support failure

* Update LICENSE

* Update policies.py

* Update test_supported_spaces.py

* Update policies.py

* Update LICENSE

* Update test_supported_spaces.py

* Update policies.py

* Update policies.py

* Update filter.py
2018-08-24 22:20:02 -07:00
Michael Tu
d16b6f6a32 [tune] Rename 'repeat' to 'num_samples' (#2698)
Deprecates the `repeat` argument and introduces `num_samples`. Also updates docs accordingly.
2018-08-24 15:05:24 -07:00
Philipp Moritz
b4c47a5861 Upgrade arrow to include more detailed flushing message (#2706) 2018-08-24 11:44:04 -07:00
Eric Liang
aa014af85b
[rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700)
Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments
2018-08-23 17:49:10 -07:00
old-bear
4be324efc3 [tune] Support infinity value in report result (#2693)
* + Compatibility fix under py2 on ray.tune

* + Revert changes on master branch

* + Use default JsonEncoder in ray.tune.logger

* + Add UT for infinity support
2018-08-22 13:09:14 -07:00
joyyoj
38867eea4e [tune] Cross-Framework Compatibility (#2646)
This commit is a first pass at restructuring the Trial execution logic to support running on multiple frameworks.
2018-08-22 10:55:45 -07:00
Eric Liang
fbe6c59f72
[rllib] Misc fixes, A2C (#2679)
A bunch of minor rllib fixes:

pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
2018-08-20 15:28:03 -07:00
Yucong He
880ef1bd21 doc fix (#2696) 2018-08-20 14:11:32 -07:00
Robert Nishihara
89d4a6df93 Start Redis in protected mode when started via ray.init(). (#2697)
This PR makes it so that when Ray is started via ray.init() (as opposed to via ray start) the Redis servers will be started in "protected mode" (which means that clients can only connect by connecting to localhost).

In practice, we actually connect redis clients by passing in the node IP address (not localhost), so I need to create a redis config file on the fly to allow both localhost and the node's actual IP address (it would have been nice to find a way to do this from the Python redis client, but I couldn't find one).
2018-08-20 14:08:01 -07:00
old-bear
230ac7aa80 [tune] Compatibility fix under py2 on str condition (#2673)
* * Compatibility fix under py2 on ray.tune

* + Fix compatibility

* + Use package six to achieve str compatibility
2018-08-19 20:43:03 -07:00
Eric Liang
9473da69bd
[autoscaler] Experimental support for local / on-prem clusters (#2678)
This adds some experimental (undocumented) support for launching Ray on existing nodes. You have to provide the head ip, and the list of worker ips.

There are also a couple additional utils added for rsyncing files and port-forward.
2018-08-19 12:43:04 -07:00
Richard Liaw
62d0698097
[tune] Tune Facelift (#2472)
This PR introduces the following changes:

 * Ray Tune -> Tune 
 * [breaking] Creation of `schedulers/`, moving PBT, HyperBand into a submodule
 * [breaking] Search Algorithms now must take in experiment configurations via `add_configurations` rather through initialization
 * Support `"run": (function | class | str)` with automatic registering of trainable
 * Documentation Changes
2018-08-19 11:00:55 -07:00
Eric Liang
e56eb354eb
[tune] Remove hack to serve pin requests off thread (#2680)
* nopin

* fix
2018-08-18 13:19:52 -07:00
Wang Qing
06a58016d8 [multi-language part 2] Change the command line arguments to start raylet (#2670) 2018-08-16 21:59:44 -07:00