Commit graph

2019 commits

Author SHA1 Message Date
Eric Liang
3267676994 [Experimental] Add experimental distributed SGD API (#2858)
* check in sgd api

* idx

* foreach_worker foreach_model

* add feed_dict

* update

* yapf

* typo

* lint

* plasma op change

* fix plasma op

* still not working

* fix

* fix

* comments

* yapf

* silly flake8

* small test
2018-09-19 21:12:37 -07:00
Praveen Palanisamy
b23fd5de13 [rllib] Adds agent name & env id to default logdir prefix (#2859)
* Added agent name & env id to default logdir prefix

* Revert "Added agent name & env id to default logdir prefix"

This reverts commit 07cfdf80d2537da3c67dd4f553c5f3e43671cc7d.

* Added default logger creator with informative prefix to Agent

* Updated import order & improved str cat

* Update agent.py
2018-09-18 22:22:07 -07:00
Eric Liang
3a3782c39f
[rllib] Fix LSTM regression on truncated sequences and add regression test (#2898)
* fix

* add test

* yapf

* yapf

* fix space

* Oops that should be lstm: True

* Update cartpole_lstm.py
2018-09-18 15:09:16 -07:00
Eric Liang
ab8348b1f5
[rllib] Reward clipping should default to off 2018-09-18 15:08:01 -07:00
Hao Chen
715ec1bca5 Modularize NodeManager::ProcessClientMessage (#2895)
Split NodeManager::ProcessClientMessage into a couple of smaller functions, each of which handles one type of message.
2018-09-18 14:18:34 -07:00
Robert Nishihara
ea9d1cc887 Remove dependence on psutil. Add utility functions for getting system memory. (#2892) 2018-09-18 15:03:29 +08:00
Robert Nishihara
61bf6c6123 Fix regression in directing worker output to stdout/stderr. (#2897) 2018-09-17 16:40:45 -07:00
Richard Liaw
899e4585bc Don't include redundant entries in global_state.client_table (#2880) 2018-09-17 12:52:49 -07:00
Hanwei Jin
dc76e51a60 bugfix: cmake copy plasma java lib from lib64 directory in centos (#2885) 2018-09-16 22:32:09 -07:00
Richard Liaw
f372f48bf3
[tune] Tune onto Logging Module (#2882)
Moves Tune onto logging in Python. Ignores examples and tests.
2018-09-16 12:09:36 -07:00
Yuhong Guo
a8248e8628 Fix ObjectManager Crash (#2833)
Fixes issue where object manager sometimes crashes within the `Wait` method: The issue stems from inconsistent behavior of the boost deadline timer's `cancel` method, which is invoked within `WaitComplete` to enforce exactly one `WaitComplete` invocation for each `Wait` request. The `cancel` method sometimes fails to actually prevent the timer's invocation of the provided handler with non-zero error code.
2018-09-16 02:14:13 -04:00
Philipp Moritz
47d2f82c6c Fix common cmake dependencies (#2876) 2018-09-15 22:11:12 -07:00
Robert Nishihara
503344149f Run jupyter UI with --ip=0.0.0.0. (#2883) 2018-09-15 21:59:46 -07:00
Richard Liaw
e05baed336
[tune] Better Info String and Tweaks (#2874) 2018-09-15 11:02:13 -07:00
Hao Chen
e96817d074 fix a syntax error of initializing unordered_map (#2871)
The previous way is incompatible with older version of gcc.
2018-09-14 12:07:08 -07:00
Philipp Moritz
2c9a4f6b41 Evaluate debug logging only in debug mode (#2869)
This PR makes it so debugging logs are only evaluated during debugging. We found that for the current code, functions called in debug logging code are evaluated even in release mode (even though nothing is printed).
2018-09-14 11:40:44 -07:00
Robert Nishihara
f16d33593b Mark worker as blocked and trigger reconstruction in ray.wait. (#2864)
* Trigger reconstruction in ray.wait and mark worker as blocked.

* Add test.

* Linting.

* Don't run new test with legacy Ray.

* Only call HandleClientUnblocked if it actually blocked in ray.wait.

* Reduce time to ray.wait in the test.
2018-09-13 15:28:17 -07:00
Joerg Schad
a1b8e79c30 Fixed Typo. (#2865) 2018-09-13 13:32:56 +08:00
Hanwei Jin
fbf214e408 update ray cmake build process (#2853)
* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance

* support boost external project, avoid using the system or build.sh boost

* keep compatible with build.sh, remove boost and arrow build from it.

* bugfix: parquet bison version control, plasma_java lib install problem

* bugfix: cmake, do not compile plasma java client if no need

* bugfix: component failures test timeout machenism has problem for plasma manager failed case

* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master

* revert some fix

* set arrow python executable, fix format error in component_failures_test.py

* make clean arrow python build directory

* update cmake code style, back to support cmake minimum version 3.4
2018-09-12 11:19:33 -07:00
Daniel Ho
d9eeaaf00a [tune] Fix bug in example where config hyperparameters were ignored (#2860)
A fix to an example for tune (`python/ray/tune/examples/pbt_tune_cifar10_with_keras.py`) where the hyperparameters for the optimizer, learning rate and decay, were not being passed into the optimizer. 

This means that the current optimizer uses default values for the hyperparameters no matter the config.
2018-09-12 09:17:56 -07:00
old-bear
f3c1194be3 [tune] Add AutoML algorithm of GeneticSearcher (#2699)
Add new search algorithm (genetic) along with the base framework of the searcher (which performs some basic jobs such as logging, recording and organizing in our project).
Note that this is the initial commit. In the following days, we will add example, UT, and other refinements.
2018-09-12 09:17:04 -07:00
Eric Liang
bee743c152
Remove log suppression code
When running in a screen (or any other time it is hard to scroll up), printing "Suppressing previous error message" is not helpful since the previous error is lost far above past scrollback. Better to just print it repeatedly at the end.
 tada 1
2018-09-11 23:28:45 -07:00
Kaahan
045861c9b0 [tune] Reset Config for Trainables (#2831)
Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler.

Related issue number: #2741
2018-09-11 08:45:04 -07:00
Peter Schafhalter
5da6e78db1 Add available resources to global state (#2501) 2018-09-10 15:46:32 -07:00
Eric Liang
611259b2c7 Re-raise actor initialization errors on method invocation (#2843)
If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.
2018-09-10 10:51:19 -07:00
Hao Chen
8414e413a2 [java] refine and simplify java worker code structure (#2838) 2018-09-10 10:48:17 -07:00
Eric Liang
588c573d41 Ray stop needs to kill plasma_store_server not plasma_store (#2850) 2018-09-09 19:23:09 -07:00
Richard Liaw
af1fdc826e Pin YAPF in Travis lint build (#2848)
Avoid needing to reformat everything all the time.
2018-09-09 15:54:46 -07:00
eugenevinitsky
9ba751c29a Ars increase (#2844)
* removed cv2

* remove opencv

* increased number of default rollouts ARS

* put cv2 back in this branch

* put cv2 back in this branch

* moved cv2 back where it belongs in preprocessors
2018-09-08 14:09:02 -07:00
Robert Nishihara
bd64c940e9 Push error to driver when monitor raises an exception. (#2834) 2018-09-07 17:42:45 -07:00
Zhijun Fu
753ba76141 [Issue 2809][xray] Cleanup on driver detach (#2826)
This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass.

The following should happen when a driver exits (either gracefully or ungracefully).

#2797 should be enabled and pass.
Any actors created by the driver that are still running should be killed.
Any workers running tasks for the driver should be killed.
Any tasks for the driver in any node_manager queues should be removed.
Any future tasks received by a node manager for the driver should be ignored.
The driver death notification should only be received once.
2018-09-07 16:11:32 +08:00
Robert Nishihara
3f6ed537a4 Add ray.is_initialized() function. (#2818)
* Add ray.is_initialized() function.

* Add assert.
2018-09-06 21:20:59 -07:00
Eric Liang
e7db54bdb0 Log at INFO level by default (including in autoscaler). (#2824)
Before this change, the autoscaler `up` and related commands don't print any info messages to the console at all. This was a regression from 0.5. @richardliaw @robertnishihara https://github.com/ray-project/ray/issues/2812
2018-09-06 13:31:19 -07:00
Wang Qing
7e13e1fd49 [Java] Remove non-raylet code in Java. (#2828) 2018-09-06 14:54:13 +08:00
Eric Liang
d81605e9e7
[tune] Add a time/timesteps since last restore metric (#2819)
* rsm

* always log to avoid changing schema for csv writer

* add iter since restore

* update

* criteria warn
2018-09-05 17:45:09 -07:00
Eric Liang
995ac24a2c
[rllib] clarify train batch size for PPO (#2793)
It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.
2018-09-05 12:06:13 -07:00
Wang Qing
c87a9114cd Change the version number of Miniconda3. (#2829)
Change version number of Miniconda.

Change the version of Miniconda.
2018-09-05 12:05:04 -07:00
kary
4c0e2c3f58 [rllib]multi agent judge bug (#2821)
* fix multi agent judge bug

* Update policy_evaluator.py
2018-09-04 21:02:06 -07:00
Richard Liaw
72542c9016 [tune] Fix Pausing and Error Propogation (#2815)
* add new tests

* Try-catch errors from ray get

* longer pbt run

* Update pbt_example.py

* Split trial and result and fix tests
2018-09-04 15:22:11 -07:00
Yuhong Guo
dfb7c2be1e [Java] Add Plasma Free to Java code path (#2802) 2018-09-04 15:28:23 +08:00
Eric Liang
25ffe57a5c
[rllib] Auto-synchronize filters for all agents (#2791)
This makes sure we always update the local filter, and adds an option to synchronize the remote filters as well. In APEX_DDPG we previously didn't do either. The first is needed for checkpoint correctness, the second might help performance.
2018-09-03 20:01:53 -07:00
Philipp Moritz
a34a7172b4 Remove gflags (#2813)
Seems like gflags is not needed. This *might* remove writing spurious files into the home directory on the RISE infrastructure.
2018-09-03 16:10:47 -07:00
Eric Liang
01b030bd57
[rllib] throw an error for continuous action spaces in IMPALA
We currently don't support this since the reference vtrace.py does not, though it could be an interesting extension.
2018-09-03 11:12:55 -07:00
Eric Liang
df4788e501
[rllib/tune] Add test for fractional gpu support in xray mode; add rllib support for fractional gpu (#2768)
* frac gpu

* doc

* Update rllib-training.rst

* yapf

* remove xray
2018-09-03 11:12:23 -07:00
Hao Chen
9d655721e5 [java] support creating an actor with parameters (#2817)
Previously `Ray.createActor` only support creating an actor without any parameter. This PR adds the support for creating an actor with parameters. Moreover, besides using a constructor, it's now also allowed to create an actor with a factory method. For more usage, prefer refer to `ActorTest.java`.
2018-09-03 09:53:03 -07:00
Eric Liang
b37a283053 [rllib] support local mode (#2795) 2018-09-02 23:02:19 -07:00
Robert Nishihara
0ac855e061 Push errors to all drivers when node is marked dead. (#2808)
* Push errors to all drivers when node is marked dead.

* Fix
2018-09-02 20:04:58 -07:00
Robert Nishihara
c71bbbc3af Add test (currently skipped) that drivers release resources when exiting. (#2797)
* Add test (currently skipped) that drivers release resources when exiting.

* Add test for ungraceful driver exit.

* Small fix.

* Small fix
2018-09-02 17:34:48 -07:00
Robert Nishihara
e5fd1d55a1 Ignore failing global sheduler valgrind test. (#2805) 2018-09-02 15:23:32 -07:00
Hao Chen
3b0a2c4197 [Java] improve Java API module (#2783)
API module (`ray/java/api` dir) includes all public APIs provided by Ray, it should be the only module that normal Ray users need to face.

The purpose of this PR to first improve the code quality of the API module. Subsequent PRs will improve other modules later. The changes of this PR include the following aspects: 
1) Only keep interfaces in api module, to hide implementation details from users and fix circular dependencies among modules.
2) Document everything in the api module. 
3) Improve naming.
4) Add more tests for API. 
5) Also fix/improve related code in other modules.
6) Remove some unused code.

(Apologize for posting such a large PR. Java worker code has been lack of maintenance for a while. There're a lot of code quality issues that need to be fixed. We plan to use a couple of large PRs to address them. After that, future changes will come in small PRs.)
2018-09-02 11:51:16 -07:00