Commit graph

1661 commits

Author SHA1 Message Date
Eric Liang
79d37ce240
[rllib] Switch to use lz4 instead of snappy (#1847)
* lz4

* comment

* updates
2018-04-07 14:43:45 -07:00
Eric Liang
e6c00b2b5e
[tune] Add util function to broadcast objects (#1845)
* add util

* Fri Apr  6 15:09:20 PDT 2018

* doc

* Fri Apr  6 15:21:42 PDT 2018

* Fri Apr  6 15:28:07 PDT 2018

* Fri Apr  6 15:28:26 PDT 2018

* Update tune-config.rst

* Update tune-config.rst
2018-04-07 11:37:14 -07:00
Richard Liaw
bc8f62c947
[tune] Fix Median Stopping Rule Verbosity (#1833) 2018-04-06 22:58:13 -07:00
Stephanie Wang
bef1d872b4
[xray] Cleanup Raylet processes on exit (#1839)
* Add raylet monitor script to timeout Raylet heartbeats

* Unit test for removing a different client from the client table

* Set node manager heartbeat according to global config

* Doc and fixes

* Add regression test for client table disconnect, refactor client table

* Convert 'Terminate' methods to destructors

* Destroy the Raylet on a SIGTERM

* Clean up workers on a SIGTERM
2018-04-06 17:21:51 -07:00
Melih Elibol
3bf80839cb Remove all runtime errors. (#1840) 2018-04-06 17:20:52 -07:00
Melih Elibol
c7e11e9057 lint fix. (#1842) 2018-04-06 13:28:52 -07:00
Melih Elibol
24a8cede88
Cache object info from store notification. (#1815)
Cache all object info from object added store notification & submit to GCS via object directory.
2018-04-06 02:33:23 -07:00
Stephanie Wang
bf194db4bc [xray] Basic actor support (#1835) 2018-04-06 00:17:14 -07:00
Melih Elibol
313b864e66
disconnect bug fix. (#1837) 2018-04-05 22:10:51 -07:00
Stephanie Wang
cbf3181fd2 [xray] Monitor for Raylet processes (#1831)
* Add raylet monitor script to timeout Raylet heartbeats

* Unit test for removing a different client from the client table

* Set node manager heartbeat according to global config

* Doc and fixes

* Add regression test for client table disconnect, refactor client table

* Fix linting.
2018-04-05 20:45:38 -07:00
Devin Petersohn
0d9a7a3c19 [DataFrame] Update architecture to be more flexible and performant (#1821) 2018-04-05 15:14:33 -07:00
Robert Nishihara
5bde5e75e7 Implement unsafe method for flushing entire object table and task table. (#1824)
* Implement unsafe method for flushing entire object table and task table.

* Add test.

* Fix test.
2018-04-04 18:29:24 -07:00
Richard Liaw
888e70f1be
[tune] HyperOpt Support (v2) (#1763) 2018-04-04 11:08:26 -07:00
Alexey Tumanov
5a9e83761d fix unused-lambda-capture on clang version 9.1 (#1823)
* fix unused-lambda-capture on clang9.1

* unused lambda capture fix continued

* lambda capture: NM

* lambda capture

* Fix linting.
2018-04-04 11:04:10 -07:00
Robert Nishihara
e0193a5501 Print backtrace for RAY_LOG(FATAL) and also add file and line number … (#1805)
* Print backtrace for RAY_LOG(FATAL) and also add file and line number in common case.

* Fix linting.
2018-04-03 10:12:46 -07:00
Robert Nishihara
fbfbb1c079 [xray] Integrate worker.py with raylet. (#1810)
* Integrate worker with raylet.

* Begin allowing worker to attach to cluster.

* Fix linting and documentation.

* Fix linting.

* Comment tests back in.

* Fix type of worker command.

* Remove xray python files and tests.

* Fix from rebase.

* Add test.

* Copy over raylet executable.

* Small cleanup.
2018-04-03 02:38:56 -07:00
Robert Nishihara
0fc989c6c1 Don't use 127.0.0.1 for local ip address. (#1596)
* Don't use 127.0.0.1 for ip address.

* Update test
2018-04-02 00:34:20 -07:00
Robert Nishihara
d3e974a9a4 Increase ulimit -n in autoscaler examples. (#1769) 2018-04-02 00:32:56 -07:00
Robert Nishihara
27a0d58e54 Include resource string in error message for infeasible actors. (#1768) 2018-04-02 00:31:30 -07:00
Robert Nishihara
23b8793f0e Update documentation and autoscaler to find 0.4.0. (#1789) 2018-04-02 00:28:47 -07:00
Robert Nishihara
5c86f34066 Add 0.4 release blog post. (#1794) 2018-04-02 00:23:56 -07:00
Philipp Moritz
71829a2af9 [XRay] Pass in node IP address to Raylet (#1808) 2018-04-02 00:21:19 -07:00
Philipp Moritz
0bda11e009 [XRay] Fix linting (#1809) 2018-04-01 23:11:06 -07:00
Melih Elibol
6e06a9e338 XRay Task Forwarding Milestone (#1785)
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.

Raylet Changes:

Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:

LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:

Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:

Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).


Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
2018-03-31 18:02:58 -07:00
Philipp Moritz
40c9b9cd60 Fix the setuptools_scm issue (#1784) 2018-03-31 10:33:40 -07:00
Eric Liang
faaa123046 [rllib] Set num_cpu=None for workers in the default settings (#1793) 2018-03-29 16:33:40 -07:00
Eric Liang
4116c64698
[tune] Remove rllib dep again, and add a test (#1792)
* tune should not depend on rllib

* fix dep test

* Tue Mar 27 16:55:41 PDT 2018

* f401
2018-03-29 15:36:49 -07:00
Stephanie Wang
925e392b2d Add an Append call to the GCS Log that checks for current length (#1788)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log

* AppendAt Redis and gcs Log command

* unit test for AppendAt

* Add a Log for task reconstruction data

* Add check for unique entries in TABLE_APPEND

* Documentation
2018-03-27 13:04:43 -07:00
Robert Nishihara
8d52fe931b Add experimental feature for flushing event logs and logfiles. (#1659)
* Add experimental feature for flushing event logs and logfiles.

* Add documentation.
2018-03-27 11:57:52 -07:00
Robert Nishihara
f69cbd35d4 Bump version to 0.4.0. (#1745) 2018-03-26 22:37:16 -07:00
Robert Nishihara
de3cfa223d Fix monitor.py bottleneck by removing excess Redis queries. (#1786)
* Fix monitor.py bottleneck by removing excess Redis queries.

* Remove unnecessary default value.
2018-03-26 22:30:38 -07:00
Stephanie Wang
51fdbe3867 Convert the ObjectTable implementation to a Log (#1779)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log
2018-03-26 20:36:48 -07:00
Robert Nishihara
1ab0d0ea69 Acquire worker lock when importing actor. (#1783) 2018-03-26 18:31:26 -07:00
Stephanie Wang
0fd4112354 Introduce a log interface for the new GCS (#1771)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint
2018-03-26 16:00:43 -07:00
Eric Liang
7c4afa4b04 [tune] Fix linting error (#1777) 2018-03-25 23:44:14 -07:00
Yan Facai (颜发才)
6b1e592d5c [tune] Added pbt with keras on cifar10 dataset example (#1729)
* [tune] Added pbt with keras on cifar10 dataset example

* ENH: add gpu resources

* CLN: requires 4 GPUs resource

* CLN: use single quotes

* CLN: don't save model by default
2018-03-25 15:57:23 -07:00
Stephanie Wang
0ad1054b8b
Add a GCS table for the xray task flatbuffer (#1775)
* Introduce Task flatbuffer into xray, add to GCS

* Compile and test raylet TaskTable
2018-03-23 13:18:23 -07:00
Eric Liang
72595cca0d [tune] Change tune resource request syntax to be less confusing (#1764)
* update

* update examples

* Wed Mar 21 15:19:56 PDT 2018

* Wed Mar 21 15:21:32 PDT 2018

* Update train_a3c.py

* Update train.py

* fix resources accounting
2018-03-23 06:25:01 -07:00
Robert Nishihara
10dabce4d7 Remove from X import Y convention in RLlib ES. (#1774) 2018-03-23 05:54:31 -07:00
Christian Barra
13b3df9321 Check if the provider is external before getting the config. (#1743) 2018-03-22 22:59:29 -07:00
Stephanie Wang
8704c8618c
Request and cancel notifications in the new GCS API (#1758)
* Add TableRequestNotifications and TableCancelNotifications to Redis modules

* Add RequestNotifications and CancelNotifications to generic GCS Table

* Add tests for subscribing to specific keys

* Remove TODO!

* Return the current value at the key directly from RequestNotifications instead of through publish

* Add unit test for Lookup failure callback

* Modify tests to account for empty subscription response

* Remove ObjectTable notification methods

* Clean up message parsing and doc in redis context

* Use vectors of DataT in all GCS callbacks

* Clean up SubscriptionCallback

* Move Table definitions into tables.cc

* Refactor and document redis modules

* doc

* Fix new GCS build

* Cleanups

* Revert "Fix new GCS build"

This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.

* Use vectors for internal callback interface, user-facing interface takes a reference to a single item

* Fix new GCS build

* Add unit test for Lookup failure callback

* Fix compiler errors

* Cleanup

* Publish the entry ID with the notification

* Check that the ID for a notification matches in client tests
2018-03-22 10:31:07 -07:00
Robert Nishihara
0c835a379f Fix resource bookkeeping for blocked actor methods. (#1766) 2018-03-21 20:48:04 -07:00
Robert Nishihara
c6ad71fc9d Fix bug when connecting another driver in local case. (#1760)
* Allow connecting another driver when using ip address 127.0.0.1.

* Add test.
2018-03-21 11:49:53 -07:00
Stephanie Wang
5c7ef34b05
Define string prefixes for all tables in the new GCS API (#1755)
* Define string prefixes for all tables in the new GCS API

* Extra check for TablePrefix enum

* Remove unused field and add doc for existing fields
2018-03-20 20:27:11 -07:00
Eric Liang
b41bdcefa0
[rllib] Update RLlib to work with new actor scheduling behavior (#1754)
* Mon Mar 19 21:23:01 PDT 2018

* Mon Mar 19 21:23:07 PDT 2018

* Mon Mar 19 21:30:49 PDT 2018

* Mon Mar 19 21:32:05 PDT 2018

* Mon Mar 19 21:35:43 PDT 2018

* fix cpu limits

* Mon Mar 19 22:25:07 PDT 2018
2018-03-20 19:29:52 -07:00
Robert Nishihara
4bccabd910 Redirect output of all processes by default. (#1752)
* Redirect output of all processes by default.

* Add separate flag for redirecting worker output.

* Fix tests.
2018-03-20 18:14:54 -07:00
Robert Nishihara
2922e1c388 Add API for getting total cluster resources. (#1736)
* Add API for getting total cluster resources.

* Add test.
2018-03-20 15:57:00 -07:00
Robert Nishihara
794f547d0a Always send actor creation tasks to the global scheduler. (#1757) 2018-03-20 14:55:20 -07:00
Robert Nishihara
4658d0a180 Print error when actor takes too long to start, and refactor error me… (#1747)
* Print error when actor takes too long to start, and refactor error message pushing.

* Print warning every ten seconds.

* Fix linting and tests.

* Fix tests.
2018-03-19 20:24:35 -07:00
Robert Nishihara
73bb149c8a Remove unnecessary file. (#1742) 2018-03-19 19:36:18 -07:00