Commit graph

1027 commits

Author SHA1 Message Date
Philipp Moritz
8894883153 Force kill web UI in ray stop (#3257) 2018-11-08 00:05:32 -08:00
Eric Liang
9b2794101d
[minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228) 2018-11-08 00:04:20 -08:00
Stephanie Wang
d950e92f63
Allow multiple threads to call ray.get and ray.wait (#3244)
* Handle multiple threads calling ray.get

* Multithreaded ray.wait

* Pass in current task ID in java backend

* Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get

* Fix test

* Some cleanups

* Improve error message

* Add assertion

* Cleanup, throw error in HandleTaskUnblocked if task not actually blocked

* lint

* Fix python worker reset

* Fix references to reconstruct_objects

* Linting

* java lint

* Fix java

* Fix iterator
2018-11-07 22:39:28 -08:00
Richard Liaw
0bab8ed95c
Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Eric Liang
43df405d07 [rllib] Add some debug logs during agent setup (#3247) 2018-11-07 14:54:28 -08:00
Richard Liaw
cf9e838326
[tune] Raise Error when overstepping (#3235) 2018-11-07 14:27:09 -08:00
Eric Liang
29e3362905
Better errors on process deaths (#3252) 2018-11-07 14:08:16 -08:00
Robert Nishihara
1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Eric Liang
2e04ffe00c Change dict serialization warning to debug (#3230) 2018-11-06 21:23:07 -08:00
eugenevinitsky
344b4ef0ff [rllib] Fix filter sync for ES and ARS (#2918) 2018-11-06 19:09:34 -08:00
Eric Liang
725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00
Peter Schafhalter
f3efcd2342 Fix password authentication in worker (#3124) 2018-11-06 13:40:03 -08:00
Eric Liang
8356a01dd6
Remove suppressing duplicate error message (missed a couple) 2018-11-05 23:37:14 -08:00
Eric Liang
80f63696ac Cap object store memory to 20GB when size is None (#3243)
* Update services.py

* Update services.py
2018-11-05 18:34:19 -08:00
Eric Liang
813f51769f [rllib] Fix rllib rollouts script and add test (#3211)
## What do these changes do?

Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py

## Related issue number

https://github.com/ray-project/ray/issues/3206
https://github.com/ray-project/ray/issues/3204
2018-11-05 00:33:25 -08:00
xutianming
fb6ac28b44 single sourcing the package version (#3220) 2018-11-04 13:53:55 -08:00
Eric Liang
369cb833fe
[rllib] Implement custom metrics (#3144) 2018-11-03 18:48:32 -07:00
Eric Liang
7d69c77a19
[rllib] Decouple ape-x sampling and learning speed 2018-11-03 18:07:39 -07:00
Eric Liang
9a0f0db070 Add ray stack tool for debugging (#3213) 2018-11-03 13:13:02 -07:00
Wang Qing
ca7d4c2cf5 Enable to specify driver id by user. (#3084) 2018-11-02 19:01:50 -07:00
Robert Nishihara
e495ab5e7c Fix some paths /tmp/raylogs -> /tmp/ray. (#3189) 2018-11-02 12:10:53 -07:00
Robert Nishihara
5822aa2388 Rename get_task -> worker_idle in timeline. (#3179)
* Rename get_task -> worker_idle in timeline.

* Fix test.
2018-11-02 12:08:46 -07:00
Eric Liang
2bef9844bf
Revert "[autoscaler] Also grant roles to worker nodes" (#3199)
This reverts commit 55d161b49f.
2018-11-01 23:23:06 -07:00
Robert Nishihara
e612e26103 Add use_raylet option for backwards compatibility. (#3176)
* Add use_raylet option for backwards compatibility.

* Update message.
2018-11-01 14:16:04 -07:00
Eric Liang
b2caed9651
[minor] fix a3c pytorch example dim 80 => 84 2018-10-31 22:00:14 -07:00
Eric Liang
cd284bb487
[rllib] Document env compatibility, Ape-X support for multi-agent (#3147) 2018-10-31 21:59:34 -07:00
Richard Liaw
2086a57e61
[tune] Add Fractional GPU example/docs (#3169)
* Add example for fractional GPU support

* Update tune_mnist_keras.py

* Update doc/source/tune-usage.rst
2018-10-31 18:53:16 -07:00
Robert Nishihara
1f29a960f4 Update task_table and object_table API. (#3161)
* Update task_table and object_table API.

* Fix
2018-10-31 12:52:50 -07:00
Dennis Chung
9df2e6e6f4 [tune] Modify stop criteria in hyperopt example (#3102)
Modify `training_iteraion` to `timesteps_total` because only `timesteps_total` is inside the reporter.
2018-10-30 13:26:40 -07:00
Eric Liang
a221f55b0d
[rllib] Add custom value functions, fix up and document multi-agent variable sharing (#3151) 2018-10-29 19:37:27 -07:00
Robert Nishihara
e49839c73f Fix linting. (#3155) 2018-10-28 20:43:29 -07:00
Robert Nishihara
32f0d6b77e Deprecate num_workers argument to ray.init and ray start. (#3114)
* Remove num_workers argument.

* Fix

* Fix
2018-10-28 20:12:49 -07:00
Robert Nishihara
9868af4c7c Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149)
* Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small.

* Add logging statement and address comments.

* Fix
2018-10-28 20:09:06 -07:00
Robert Nishihara
08fc9e5bcd Add more description to setup.py. (#3153) 2018-10-28 19:49:52 -07:00
Robert Nishihara
fd854ff090 Allow the node manager port and object manager port to be set through… (#3130)
* Allow the node manager port and object manager port to be set through ray start.

* Linting

* Fix Java test

* Address comments.
2018-10-28 17:28:41 -07:00
Eric Liang
a404401dc6
Update agent.py to fix lint error 2018-10-28 15:28:08 -07:00
Jones Wong
d6bf890648 Solve hang caused by ray.get in collect_metrics (#3096) 2018-10-28 11:52:18 -07:00
Eric Liang
af0c1174cd
[sgd] Merge sharded param server based SGD implementation (#3033)
This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.

$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
  --devices-per-worker=M --strategy=<simple|ps> \
  --warmup --object-store-memory=10000000000

Images per second total
gpus total              | simple | ps
========================================
1                       | 218
2 (1 worker)            | 388
4 (1 worker)            | 759
4 (2 workers)           | 176    | 623
8 (1 worker)            | 985
8 (2 workers)           | 349    | 1031
16 (2 nodes, 2 workers) | 600    | 1661
16 (2 nodes, 4 workers) | 468    | 1712   <--- OSDI perf was 1817
2018-10-27 21:25:02 -07:00
Eric Liang
6531eed2d0 [rllib] Better error message when action space dim too high (#3119) 2018-10-26 16:55:00 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Eric Liang
055daf17a0
[autoscaler] better message if there are more than 10 key pairs 2018-10-26 12:42:11 -07:00
Philipp Moritz
d3148cc3ab [SGD] Provide better error message if model graphs have different numbers of variables (#3139) 2018-10-25 22:18:10 -07:00
Robert Nishihara
5aa29613db Fix linting errors. (#3127) 2018-10-24 16:30:00 -07:00
Eric Liang
55d161b49f
[autoscaler] Also grant roles to worker nodes 2018-10-24 13:57:36 -07:00
Robert Nishihara
9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Robert Nishihara
9d2e864caf Fix Python linting error. (#3113) 2018-10-22 23:41:42 -07:00
Eric Liang
73a092e08c
update (#3112) 2018-10-22 22:55:43 -07:00
Richard Liaw
eff7cb4458 [tune] Fix SearchAlg finishing early (#3081)
* Fix trial search alg finishing early

* Fix lint

* fix lint

* nit fix
2018-10-22 12:17:13 -07:00
Eric Liang
221d1663c1
[rllib] switch to python logger (#3098)
* logg

* set rllib logger

* comment

* info

* rlib

* comment

* add format

* fix lint

* add file info

* update

* add ts

* lint

* better docs

* fix value error

* soft log level
2018-10-21 23:43:57 -07:00
Richard Liaw
40c4148d4f Cluster Utilities for Fault Tolerance Tests (#3008) 2018-10-20 22:56:29 -07:00