Commit graph

6870 commits

Author SHA1 Message Date
Robert Nishihara
1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Eric Liang
2e04ffe00c Change dict serialization warning to debug (#3230) 2018-11-06 21:23:07 -08:00
eugenevinitsky
344b4ef0ff [rllib] Fix filter sync for ES and ARS (#2918) 2018-11-06 19:09:34 -08:00
Eric Liang
725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00
Peter Schafhalter
f3efcd2342 Fix password authentication in worker (#3124) 2018-11-06 13:40:03 -08:00
Eric Liang
8356a01dd6
Remove suppressing duplicate error message (missed a couple) 2018-11-05 23:37:14 -08:00
Eric Liang
80f63696ac Cap object store memory to 20GB when size is None (#3243)
* Update services.py

* Update services.py
2018-11-05 18:34:19 -08:00
Eric Liang
813f51769f [rllib] Fix rllib rollouts script and add test (#3211)
## What do these changes do?

Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py

## Related issue number

https://github.com/ray-project/ray/issues/3206
https://github.com/ray-project/ray/issues/3204
2018-11-05 00:33:25 -08:00
xutianming
fb6ac28b44 single sourcing the package version (#3220) 2018-11-04 13:53:55 -08:00
Eric Liang
369cb833fe
[rllib] Implement custom metrics (#3144) 2018-11-03 18:48:32 -07:00
Eric Liang
7d69c77a19
[rllib] Decouple ape-x sampling and learning speed 2018-11-03 18:07:39 -07:00
Eric Liang
9a0f0db070 Add ray stack tool for debugging (#3213) 2018-11-03 13:13:02 -07:00
Wang Qing
ca7d4c2cf5 Enable to specify driver id by user. (#3084) 2018-11-02 19:01:50 -07:00
Robert Nishihara
e495ab5e7c Fix some paths /tmp/raylogs -> /tmp/ray. (#3189) 2018-11-02 12:10:53 -07:00
Robert Nishihara
5822aa2388 Rename get_task -> worker_idle in timeline. (#3179)
* Rename get_task -> worker_idle in timeline.

* Fix test.
2018-11-02 12:08:46 -07:00
Eric Liang
2bef9844bf
Revert "[autoscaler] Also grant roles to worker nodes" (#3199)
This reverts commit 55d161b49f.
2018-11-01 23:23:06 -07:00
Robert Nishihara
e612e26103 Add use_raylet option for backwards compatibility. (#3176)
* Add use_raylet option for backwards compatibility.

* Update message.
2018-11-01 14:16:04 -07:00
Eric Liang
b2caed9651
[minor] fix a3c pytorch example dim 80 => 84 2018-10-31 22:00:14 -07:00
Eric Liang
cd284bb487
[rllib] Document env compatibility, Ape-X support for multi-agent (#3147) 2018-10-31 21:59:34 -07:00
Richard Liaw
2086a57e61
[tune] Add Fractional GPU example/docs (#3169)
* Add example for fractional GPU support

* Update tune_mnist_keras.py

* Update doc/source/tune-usage.rst
2018-10-31 18:53:16 -07:00
Robert Nishihara
1f29a960f4 Update task_table and object_table API. (#3161)
* Update task_table and object_table API.

* Fix
2018-10-31 12:52:50 -07:00
Dennis Chung
9df2e6e6f4 [tune] Modify stop criteria in hyperopt example (#3102)
Modify `training_iteraion` to `timesteps_total` because only `timesteps_total` is inside the reporter.
2018-10-30 13:26:40 -07:00
Eric Liang
a221f55b0d
[rllib] Add custom value functions, fix up and document multi-agent variable sharing (#3151) 2018-10-29 19:37:27 -07:00
Robert Nishihara
e49839c73f Fix linting. (#3155) 2018-10-28 20:43:29 -07:00
Robert Nishihara
32f0d6b77e Deprecate num_workers argument to ray.init and ray start. (#3114)
* Remove num_workers argument.

* Fix

* Fix
2018-10-28 20:12:49 -07:00
Robert Nishihara
9868af4c7c Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149)
* Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small.

* Add logging statement and address comments.

* Fix
2018-10-28 20:09:06 -07:00
Robert Nishihara
08fc9e5bcd Add more description to setup.py. (#3153) 2018-10-28 19:49:52 -07:00
Robert Nishihara
fd854ff090 Allow the node manager port and object manager port to be set through… (#3130)
* Allow the node manager port and object manager port to be set through ray start.

* Linting

* Fix Java test

* Address comments.
2018-10-28 17:28:41 -07:00
Eric Liang
a404401dc6
Update agent.py to fix lint error 2018-10-28 15:28:08 -07:00
Jones Wong
d6bf890648 Solve hang caused by ray.get in collect_metrics (#3096) 2018-10-28 11:52:18 -07:00
Eric Liang
af0c1174cd
[sgd] Merge sharded param server based SGD implementation (#3033)
This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.

$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
  --devices-per-worker=M --strategy=<simple|ps> \
  --warmup --object-store-memory=10000000000

Images per second total
gpus total              | simple | ps
========================================
1                       | 218
2 (1 worker)            | 388
4 (1 worker)            | 759
4 (2 workers)           | 176    | 623
8 (1 worker)            | 985
8 (2 workers)           | 349    | 1031
16 (2 nodes, 2 workers) | 600    | 1661
16 (2 nodes, 4 workers) | 468    | 1712   <--- OSDI perf was 1817
2018-10-27 21:25:02 -07:00
Eric Liang
6531eed2d0 [rllib] Better error message when action space dim too high (#3119) 2018-10-26 16:55:00 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Eric Liang
055daf17a0
[autoscaler] better message if there are more than 10 key pairs 2018-10-26 12:42:11 -07:00
Philipp Moritz
d3148cc3ab [SGD] Provide better error message if model graphs have different numbers of variables (#3139) 2018-10-25 22:18:10 -07:00
Robert Nishihara
5aa29613db Fix linting errors. (#3127) 2018-10-24 16:30:00 -07:00
Eric Liang
55d161b49f
[autoscaler] Also grant roles to worker nodes 2018-10-24 13:57:36 -07:00
Robert Nishihara
9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Robert Nishihara
9d2e864caf Fix Python linting error. (#3113) 2018-10-22 23:41:42 -07:00
Eric Liang
73a092e08c
update (#3112) 2018-10-22 22:55:43 -07:00
Richard Liaw
eff7cb4458 [tune] Fix SearchAlg finishing early (#3081)
* Fix trial search alg finishing early

* Fix lint

* fix lint

* nit fix
2018-10-22 12:17:13 -07:00
Eric Liang
221d1663c1
[rllib] switch to python logger (#3098)
* logg

* set rllib logger

* comment

* info

* rlib

* comment

* add format

* fix lint

* add file info

* update

* add ts

* lint

* better docs

* fix value error

* soft log level
2018-10-21 23:43:57 -07:00
Richard Liaw
40c4148d4f Cluster Utilities for Fault Tolerance Tests (#3008) 2018-10-20 22:56:29 -07:00
Eric Liang
59901a88a0
[rllib] Native support for Dict and Tuple spaces; fix Tuple action spaces; add prev a, r to LSTM (#3051) 2018-10-20 15:21:22 -07:00
Peter Schafhalter
fa469783d8 Fix bug when connecting to password-secured cluster (#3083) 2018-10-18 21:43:03 -07:00
Devin Petersohn
8fcdafc6ea Adding Python3.7 wheels support (#2546)
* Adding Python3.7 wheels support

* Adding Mac wheels update

* fix

* numpy version

* choose different numpy versions depending on python version

* fix
2018-10-18 17:58:39 -07:00
Peter Schafhalter
b82fd157a7 Remove Redis protected mode (#3073)
Follow-up to #2925 and #2952. Removes the Redis protected mode implementation from Ray which was replaced by Redis port authentication.
2018-10-17 22:48:14 -07:00
Philipp Moritz
2c52d9dfa0 Fix actor handle id creation when actor handle was pickled (#3074) 2018-10-17 18:00:52 -07:00
Richard Liu
3c0803e7e9 [rllib] use ray.wait to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00
Peter Schafhalter
a41bbc10ef Add password authentication to Redis ports (#2952)
* Implement Redis authentication

* Throw exception for legacy Ray

* Add test

* Formatting

* Fix bugs in CLI

* Fix bugs in Raylet

* Move default password to constants.h

* Use pytest.fixture

* Fix bug

* Authenticate using formatted strings

* Add missing passwords

* Add test

* Improve authentication of async contexts

* Disable Redis authentication for credis

* Update test for credis

* Fix rebase artifacts

* Fix formatting

* Add workaround for issue #3045

* Increase timeout for test

* Improve C++ readability

* Fixes for CLI

* Add security docs

* Address comments

* Address comments

* Adress comments

* Use ray.get

* Fix lint
2018-10-16 22:48:30 -07:00