Commit graph

2152 commits

Author SHA1 Message Date
Eric Liang
2e04ffe00c Change dict serialization warning to debug (#3230) 2018-11-06 21:23:07 -08:00
Stephanie Wang
ca585703b2 Refactor ObjectDirectory to reduce and fix callback usage (#3227) 2018-11-06 20:33:10 -08:00
eugenevinitsky
344b4ef0ff [rllib] Fix filter sync for ES and ARS (#2918) 2018-11-06 19:09:34 -08:00
Eric Liang
725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00
Peter Schafhalter
f3efcd2342 Fix password authentication in worker (#3124) 2018-11-06 13:40:03 -08:00
Eric Liang
8356a01dd6
Remove suppressing duplicate error message (missed a couple) 2018-11-05 23:37:14 -08:00
Eric Liang
80f63696ac Cap object store memory to 20GB when size is None (#3243)
* Update services.py

* Update services.py
2018-11-05 18:34:19 -08:00
Wang Qing
4968cc5d70 Fix a small typo (#3240) 2018-11-05 18:30:53 -08:00
Stephanie Wang
bf88aa5013
Increase timeout before reconstruction is triggered (#3217)
* Increase timeout to 10s

* Skip eviction reconstruction tests

* Add stress test for many actors to one

* Fix test by shortening it.

* lower number of processes in stress test

* Skip slow test
2018-11-05 18:03:50 -08:00
Ion
d8ae9de99c Caching task resource requirements. (#3231)
* caching resource requirements

* small fixes

* avoid copying the resource map
2018-11-05 15:14:09 -08:00
Eric Liang
813f51769f [rllib] Fix rllib rollouts script and add test (#3211)
## What do these changes do?

Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py

## Related issue number

https://github.com/ray-project/ray/issues/3206
https://github.com/ray-project/ray/issues/3204
2018-11-05 00:33:25 -08:00
Philipp Moritz
99bac44375 Update CMake to support Mac OS X 10.14 (#3218) 2018-11-04 16:32:58 -08:00
xutianming
fb6ac28b44 single sourcing the package version (#3220) 2018-11-04 13:53:55 -08:00
Eric Liang
369cb833fe
[rllib] Implement custom metrics (#3144) 2018-11-03 18:48:32 -07:00
Eric Liang
7d69c77a19
[rllib] Decouple ape-x sampling and learning speed 2018-11-03 18:07:39 -07:00
Philipp Moritz
0da15b1c1f Fix build system dependency for local_scheduler_client (#3215) 2018-11-03 13:19:02 -07:00
Eric Liang
9a0f0db070 Add ray stack tool for debugging (#3213) 2018-11-03 13:13:02 -07:00
Wang Qing
ca7d4c2cf5 Enable to specify driver id by user. (#3084) 2018-11-02 19:01:50 -07:00
Si-Yuan
5ce7ed7dad Fix 'tempfile' docs (#3180)
* Fix docs.

* Update doc/source/tempfile.rst

Co-Authored-By: suquark <suquark@gmail.com>

* Remove doc for raylet socket.
2018-11-02 16:50:55 -07:00
Eric Liang
8c03683573 Add warning about using latest wheels (#3207) 2018-11-02 15:41:10 -07:00
Robert Nishihara
e495ab5e7c Fix some paths /tmp/raylogs -> /tmp/ray. (#3189) 2018-11-02 12:10:53 -07:00
Robert Nishihara
5822aa2388 Rename get_task -> worker_idle in timeline. (#3179)
* Rename get_task -> worker_idle in timeline.

* Fix test.
2018-11-02 12:08:46 -07:00
Eric Liang
2bef9844bf
Revert "[autoscaler] Also grant roles to worker nodes" (#3199)
This reverts commit 55d161b49f.
2018-11-01 23:23:06 -07:00
Robert Nishihara
e612e26103 Add use_raylet option for backwards compatibility. (#3176)
* Add use_raylet option for backwards compatibility.

* Update message.
2018-11-01 14:16:04 -07:00
Robert Nishihara
57d6e98302 Update actor fault tolerance documentation. (#3175) 2018-11-01 11:52:05 -07:00
Robert Nishihara
60f28040ea Document fractional resources. (#3174) 2018-11-01 10:50:56 -07:00
Eric Liang
b2caed9651
[minor] fix a3c pytorch example dim 80 => 84 2018-10-31 22:00:14 -07:00
Eric Liang
cd284bb487
[rllib] Document env compatibility, Ape-X support for multi-agent (#3147) 2018-10-31 21:59:34 -07:00
Richard Liaw
2086a57e61
[tune] Add Fractional GPU example/docs (#3169)
* Add example for fractional GPU support

* Update tune_mnist_keras.py

* Update doc/source/tune-usage.rst
2018-10-31 18:53:16 -07:00
Robert Nishihara
1f29a960f4 Update task_table and object_table API. (#3161)
* Update task_table and object_table API.

* Fix
2018-10-31 12:52:50 -07:00
Dennis Chung
9df2e6e6f4 [tune] Modify stop criteria in hyperopt example (#3102)
Modify `training_iteraion` to `timesteps_total` because only `timesteps_total` is inside the reporter.
2018-10-30 13:26:40 -07:00
Stephanie Wang
aacbd007a0
[xray] Implement faster flush policy for lineage cache (#3071)
* Policy that flushes the lineage stash immediately

* Fix bug where remote tasks in uncommitted lineage weren't getting subscribed to, add reg test

* test

* Fix bug where waiting task was getting subscribed

* Cleanup

* Update src/ray/raylet/lineage_cache.cc

Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>

* Update src/ray/raylet/lineage_cache.cc

Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>

* cleanup

* cleanup

* Add another test for task with many parents

* fix, unsubscribe to new waiting tasks

* Unsubscribe as soon as the commit notification is handled
2018-10-30 09:59:50 -07:00
Eric Liang
a221f55b0d
[rllib] Add custom value functions, fix up and document multi-agent variable sharing (#3151) 2018-10-29 19:37:27 -07:00
Robert Nishihara
e49839c73f Fix linting. (#3155) 2018-10-28 20:43:29 -07:00
Robert Nishihara
32f0d6b77e Deprecate num_workers argument to ray.init and ray start. (#3114)
* Remove num_workers argument.

* Fix

* Fix
2018-10-28 20:12:49 -07:00
Robert Nishihara
9868af4c7c Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149)
* Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small.

* Add logging statement and address comments.

* Fix
2018-10-28 20:09:06 -07:00
Robert Nishihara
08fc9e5bcd Add more description to setup.py. (#3153) 2018-10-28 19:49:52 -07:00
Robert Nishihara
fd854ff090 Allow the node manager port and object manager port to be set through… (#3130)
* Allow the node manager port and object manager port to be set through ray start.

* Linting

* Fix Java test

* Address comments.
2018-10-28 17:28:41 -07:00
Eric Liang
a404401dc6
Update agent.py to fix lint error 2018-10-28 15:28:08 -07:00
Jones Wong
d6bf890648 Solve hang caused by ray.get in collect_metrics (#3096) 2018-10-28 11:52:18 -07:00
Eric Liang
af0c1174cd
[sgd] Merge sharded param server based SGD implementation (#3033)
This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.

$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
  --devices-per-worker=M --strategy=<simple|ps> \
  --warmup --object-store-memory=10000000000

Images per second total
gpus total              | simple | ps
========================================
1                       | 218
2 (1 worker)            | 388
4 (1 worker)            | 759
4 (2 workers)           | 176    | 623
8 (1 worker)            | 985
8 (2 workers)           | 349    | 1031
16 (2 nodes, 2 workers) | 600    | 1661
16 (2 nodes, 4 workers) | 468    | 1712   <--- OSDI perf was 1817
2018-10-27 21:25:02 -07:00
Yuhong Guo
befbf78048 Delete empty pubsub keys (#3146)
We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.).
This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.
2018-10-27 11:58:39 -07:00
Eric Liang
6531eed2d0 [rllib] Better error message when action space dim too high (#3119) 2018-10-26 16:55:00 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Eric Liang
055daf17a0
[autoscaler] better message if there are more than 10 key pairs 2018-10-26 12:42:11 -07:00
bibabolynn
b4614ae69a [java] customize path of ray.conf (#3100)
users can add custom path of ray.config by using -Dray.config=/path/to/ray.conf
2018-10-26 13:36:34 +08:00
Philipp Moritz
d3148cc3ab [SGD] Provide better error message if model graphs have different numbers of variables (#3139) 2018-10-25 22:18:10 -07:00
Philipp Moritz
d34516f1f8 Update Gemfile Jekyll version (#3140) 2018-10-25 21:43:08 -07:00
Robert Nishihara
5aa29613db Fix linting errors. (#3127) 2018-10-24 16:30:00 -07:00
Eric Liang
55d161b49f
[autoscaler] Also grant roles to worker nodes 2018-10-24 13:57:36 -07:00