Commit graph

995 commits

Author SHA1 Message Date
Robert Nishihara
9868af4c7c Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149)
* Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small.

* Add logging statement and address comments.

* Fix
2018-10-28 20:09:06 -07:00
Robert Nishihara
08fc9e5bcd Add more description to setup.py. (#3153) 2018-10-28 19:49:52 -07:00
Robert Nishihara
fd854ff090 Allow the node manager port and object manager port to be set through… (#3130)
* Allow the node manager port and object manager port to be set through ray start.

* Linting

* Fix Java test

* Address comments.
2018-10-28 17:28:41 -07:00
Eric Liang
a404401dc6
Update agent.py to fix lint error 2018-10-28 15:28:08 -07:00
Jones Wong
d6bf890648 Solve hang caused by ray.get in collect_metrics (#3096) 2018-10-28 11:52:18 -07:00
Eric Liang
af0c1174cd
[sgd] Merge sharded param server based SGD implementation (#3033)
This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.

$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
  --devices-per-worker=M --strategy=<simple|ps> \
  --warmup --object-store-memory=10000000000

Images per second total
gpus total              | simple | ps
========================================
1                       | 218
2 (1 worker)            | 388
4 (1 worker)            | 759
4 (2 workers)           | 176    | 623
8 (1 worker)            | 985
8 (2 workers)           | 349    | 1031
16 (2 nodes, 2 workers) | 600    | 1661
16 (2 nodes, 4 workers) | 468    | 1712   <--- OSDI perf was 1817
2018-10-27 21:25:02 -07:00
Eric Liang
6531eed2d0 [rllib] Better error message when action space dim too high (#3119) 2018-10-26 16:55:00 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Eric Liang
055daf17a0
[autoscaler] better message if there are more than 10 key pairs 2018-10-26 12:42:11 -07:00
Philipp Moritz
d3148cc3ab [SGD] Provide better error message if model graphs have different numbers of variables (#3139) 2018-10-25 22:18:10 -07:00
Robert Nishihara
5aa29613db Fix linting errors. (#3127) 2018-10-24 16:30:00 -07:00
Eric Liang
55d161b49f
[autoscaler] Also grant roles to worker nodes 2018-10-24 13:57:36 -07:00
Robert Nishihara
9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Robert Nishihara
9d2e864caf Fix Python linting error. (#3113) 2018-10-22 23:41:42 -07:00
Eric Liang
73a092e08c
update (#3112) 2018-10-22 22:55:43 -07:00
Richard Liaw
eff7cb4458 [tune] Fix SearchAlg finishing early (#3081)
* Fix trial search alg finishing early

* Fix lint

* fix lint

* nit fix
2018-10-22 12:17:13 -07:00
Eric Liang
221d1663c1
[rllib] switch to python logger (#3098)
* logg

* set rllib logger

* comment

* info

* rlib

* comment

* add format

* fix lint

* add file info

* update

* add ts

* lint

* better docs

* fix value error

* soft log level
2018-10-21 23:43:57 -07:00
Richard Liaw
40c4148d4f Cluster Utilities for Fault Tolerance Tests (#3008) 2018-10-20 22:56:29 -07:00
Eric Liang
59901a88a0
[rllib] Native support for Dict and Tuple spaces; fix Tuple action spaces; add prev a, r to LSTM (#3051) 2018-10-20 15:21:22 -07:00
Peter Schafhalter
fa469783d8 Fix bug when connecting to password-secured cluster (#3083) 2018-10-18 21:43:03 -07:00
Devin Petersohn
8fcdafc6ea Adding Python3.7 wheels support (#2546)
* Adding Python3.7 wheels support

* Adding Mac wheels update

* fix

* numpy version

* choose different numpy versions depending on python version

* fix
2018-10-18 17:58:39 -07:00
Peter Schafhalter
b82fd157a7 Remove Redis protected mode (#3073)
Follow-up to #2925 and #2952. Removes the Redis protected mode implementation from Ray which was replaced by Redis port authentication.
2018-10-17 22:48:14 -07:00
Philipp Moritz
2c52d9dfa0 Fix actor handle id creation when actor handle was pickled (#3074) 2018-10-17 18:00:52 -07:00
Richard Liu
3c0803e7e9 [rllib] use ray.wait to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00
Peter Schafhalter
a41bbc10ef Add password authentication to Redis ports (#2952)
* Implement Redis authentication

* Throw exception for legacy Ray

* Add test

* Formatting

* Fix bugs in CLI

* Fix bugs in Raylet

* Move default password to constants.h

* Use pytest.fixture

* Fix bug

* Authenticate using formatted strings

* Add missing passwords

* Add test

* Improve authentication of async contexts

* Disable Redis authentication for credis

* Update test for credis

* Fix rebase artifacts

* Fix formatting

* Add workaround for issue #3045

* Increase timeout for test

* Improve C++ readability

* Fixes for CLI

* Add security docs

* Address comments

* Address comments

* Adress comments

* Use ray.get

* Fix lint
2018-10-16 22:48:30 -07:00
Eric Liang
a9e454f6fd
[rllib] Include config dicts in the sphinx docs (#3064) 2018-10-16 15:55:11 -07:00
Praveen Palanisamy
4d8cfc0bf5 [tune] Fix (some more) misleading comments in tune/results.py (#3068)
## What do these changes do?

Fix the misleading comments in code for:
 - `EPISODES_THIS_ITER`
 - `EPISODES_TOTAL`

Had noted it before and planned to fix it along with some other changes but seemed very relevant to stay next to #3058 so sending this now.
2018-10-16 11:07:53 -07:00
Eric Liang
6240ccbc6e
[rllib] Add more warnings when multi-agent envs might not be set up right (#3061) 2018-10-15 13:42:56 -07:00
Eric Liang
3c891c6ece
[rllib] Parallel-data loading and multi-gpu support for IMPALA (#2766) 2018-10-15 11:02:50 -07:00
Marlon
4dc78b735b [tune] Fix misleading comment (#3058) 2018-10-14 22:25:39 -07:00
Eric Liang
866c7a574c
[rllib] Don't crash printing out error message (#3054)
* fix er

* update
2018-10-13 19:50:23 -07:00
Eric Liang
473ee4eb3f
[rllib] Add unit test and some better error messages for custom policy states (#3032) 2018-10-13 00:03:52 -07:00
Richard Liaw
f9b58d7b02
[tune] Tweaks to Trainable and Verbosity (#2889) 2018-10-11 23:42:13 -07:00
Kristian Hartikainen
2d35a97a76 Bug/log syncer fails with parentheses (#2653)
* Update rsync command

* Escape rsync locations

* Fix the accidental variable move

* Update rsync to use -s flag
2018-10-06 00:34:53 -07:00
Richard Liaw
ecd8f39580 [core] Improve logging message when plasma store is started. (#3029)
Improve logging message when plasma store is started.
2018-10-05 15:24:24 -07:00
Richard Liaw
0651d3b629 [tune/core] Use Global State API for resources (#3004) 2018-10-04 17:23:17 -07:00
Robert Nishihara
faa31ae018 Introduce concept of resources required for placing a task. (#2837)
* Introduce concept of resources required for placement.
* Add placement resources to task spec
* Update java worker
* Update taskinfo.java
2018-10-04 10:35:39 -07:00
Si-Yuan
f2dbd3096c Minor improvements and fixes in Python code. (#3022)
This commit fix some small defects. 
1. Remove a comment that should have been removed in #3003
2. Remove `redis_protected_mode` that is never used in `ray.init()`
3. Fix `object_id_seed` that is forgotten to be passed into `ray._init()`
4. Remove several redundant brackets.
2018-10-03 21:08:20 -07:00
Yuhong Guo
9948e8c11b Move function/actor exporting & loading code to function_manager.py (#3003)
Move function/actor exporting & loading code to function_manager.py to prepare the code change for function descriptor for python.
2018-10-03 16:21:04 -07:00
Robert Nishihara
d73ee36e60 Update links to use latest 0.5.3 wheels instead of 0.5.2. (#3018) 2018-10-03 13:43:40 -07:00
Si-Yuan
cc7e2ecdd5 Change logfile names and also allow plasma store socket to be passed in. (#2862) 2018-10-03 10:03:53 -07:00
Robert Nishihara
3ce8eb2d4c Test dying_worker_get and dying_worker_wait for xray. (#2997)
This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.
2018-10-02 00:08:47 -07:00
Eric Liang
2019b4122b
[rllib] Remove legacy multiagent support (#2975)
* remove legacy

* remove reshaper
2018-10-01 13:07:11 -07:00
Eric Liang
b45bed4bce
[rllib] Propagate model options correctly in ARS / ES, to action dist of PPO (#2974)
* fix

* fix

* fix it

* propagate conf to action dist

* move carla example too

* rr

* Update policies.py

* wip

* lint
2018-10-01 12:49:39 -07:00
Eric Liang
e4bea8d10e
[rllib] Default to truncate_episodes and add some more config validators (#2967)
* update

* link it

* warn about truncation

* fix

* Update rllib-training.rst

* deprecate tests failing
2018-09-30 18:37:55 -07:00
Eric Liang
814c35b7d7
[rllib] Simplify sample batch size and num envs config, n_step adjustment (#2995)
* simplify vec batch requirements

* Update rllib-training.rst

* Update rllib-training.rst

* Update rllib-training.rst

* Update rllib-training.rst

* Update rllib-training.rst

* Update rllib-models.rst
2018-09-30 18:36:22 -07:00
old-bear
8aa736572b [tune] Fix hyperband edge case for None entries (#2964) 2018-09-30 09:57:43 -07:00
Eric Liang
65dcafdc3f
[rllib] Refactor save() / restore() code of agents and avoid O(n_workers) save size (#2982) 2018-09-30 01:15:13 -07:00
Eric Liang
747253e0f6
[rllib] Don't shuffle samples in PPO when using lstm 2018-09-30 01:13:56 -07:00
Eric Liang
b06c604a51
[rllib] Add some more tuned atari results to documentation (#2991)
* dqn results ++

* add scale

* hour

* fix

* small dqn table

* update

* steps

* upd

* apex

* up

* add apex results

* tip
2018-09-29 23:13:36 -07:00