Commit graph

3771 commits

Author SHA1 Message Date
Eric Liang
b6c42f96be
Auto-scale ray clusters based on GCS load metrics (#1348)
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:

Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
2017-12-31 14:39:57 -08:00
Robert Nishihara
e970e24ea5 Update arrow, and pass memcopy_threads into put. (#1374) 2017-12-31 13:32:06 -08:00
Richard Liaw
3304099cc4
[rllib] Evaluators and Optimizers Refactoring (#1339) 2017-12-30 00:24:54 -08:00
Eric Liang
22c7c87e14 [rllib] [tune] Custom preprocessors and models, various fixes (#1372) 2017-12-28 13:19:04 -08:00
Philipp Moritz
3d224c4edf Second Part of Internal API Refactor (#1326) 2017-12-26 16:22:04 -08:00
Richard Liaw
4bb5b6bd5b [rllib] A3C Configurations (#1370)
* initial introduction of a3c configs

* fix sample batch

* flake but need to check save

* save,resotre

* fix

* pickles

* entropy

* fix

* moving ppo

* results

* jenkins
2017-12-24 12:25:13 -08:00
Richard Liaw
b217a5ef14
[rllib] Fix Pong-PPO tuned example Config (#1369) 2017-12-23 01:36:33 -08:00
Eric Liang
43e78217f8 Thu Dec 21 23:19:24 PST 2017 (#1367) 2017-12-22 17:29:45 -08:00
Robert Nishihara
22460ff7af Use Anaconda for autoscaling example and add example config for devel… (#1361)
* Use Anaconda for autoscaling example and add example config for development.

* Install Python2 for building the web ui.
2017-12-22 01:59:02 -08:00
Eric Liang
0ae660ce4e [carla] In carla example, save all images and measurements to local disk (#1350)
* revamp saving

* smaller jpgs

* hide verbose

* Tue Dec 19 22:25:01 PST 2017

* make sure temp dirs sort lexiographically

* save total reward too

* zero pad i

* 160x160 dqn

* ever higher res dqn
2017-12-21 15:19:55 -08:00
Philipp Moritz
3a301c3d56 Fix pyarrow version check (#1360) 2017-12-21 13:00:36 -08:00
Devin Petersohn
a75a473d7f Add a distributed Dataframe API to Ray (#1330)
* Adding dataframe object and minor APIs

* Adding reduce functionality

* Adding some print and making reduce work on current Ray

* Cleanup

* Added new functionality and docs.

* Adding more functionality.

* New functionality with older cleanup

* Complying with flake8 formatting

* Added tests and addressed reviewer comments

* Complying with flake8.

* Adding pandas to travis and requirements doc

* Fixing flake8 failures

* Fixing flake8 errors from imports

* Fixing import error

* Fixing import errors

* Addressing reviewer comments

* Addressing lint error
2017-12-20 09:31:22 -08:00
Cathy Wu
772527caa4 [rllib] Support 1-dimensional action spaces (PPO) (#1347)
* Small fix for supporting custom preprocessors

* PEP8

* Remove squeeze from actions
2017-12-19 14:17:06 -08:00
Eric Liang
6724f57b03 [Examples] Add Carla test env (#1343)
* add carla example

* add reward

* set obs

* Sun Dec 17 16:06:00 PST 2017

* add spec

* fix measurement

* add train script

* resize to 80x80

* null

* initial small training run

* robustify env, clean up action space

* clean up vars

* switch to town2 which is faster

* tunify train.py

* add discrete mode

* update

* fix excessive brakinG

* fix the weather

* rename

* redirect output and from future import

* doc

* update

* fix rebase

* allow dqn gpu growht

* adjust dqn hyperparams

* better ppo parameters
2017-12-19 12:57:58 -08:00
Melih Elibol
24b93b1123 fixes default type for product of empty shape. (#1341) 2017-12-18 17:41:44 -08:00
Eric Liang
47b1f02d3e [rllib] Pull out multi-gpu optimizer as a generic class (#1313) 2017-12-17 15:59:57 -08:00
Cathy Wu
53e736fe01 [rllib] Small fix for supporting custom preprocessors (#1334)
* Small fix for supporting custom preprocessors

* PEP8

* fix test
2017-12-17 04:37:29 -08:00
Eric Liang
bab44837e0
[tune] Tensorboard logger incorrectly reports training iteration as cur timestep value 2017-12-16 23:30:15 -08:00
Eric Liang
d21ea0ca45 Switch EC2 example config to use AWS deep learning AMI + latest Ray wheel (#1331)
* update

* install --user
2017-12-16 17:39:46 -08:00
Eric Liang
f5ea44338e EC2 cluster setup scripts and initial version of auto-scaler (#1311) 2017-12-15 23:56:39 -08:00
Eric Liang
fbf1806b8a
[tune] Clean up result logging: move out of /tmp, add timestamp (#1297) 2017-12-15 14:19:08 -08:00
Stephanie Wang
12fdb3f53a Convert actor dummy objects to task execution edges. (#1281)
* Define execution dependencies flatbuffer and add to Redis commands

* Convert TaskSpec to TaskExecutionSpec

* Add execution dependencies to Python bindings

* Submitting actor tasks uses execution dependency API instead of dummy argument

* Fix dependency getters and some cleanup for fetching missing dependencies

* C++ convention

* Make TaskExecutionSpec a C++ class

* Convert local scheduler to use TaskExecutionSpec class

* Convert some pointers to references

* Finish conversion to TaskExecutionSpec class

* fix

* Fix

* Fix memory errors?

* Cast flatbuffers GetSize to size_t

* Fixes

* add more retries in global scheduler unit test

* fix linting and cast fbb.GetSize to size_t

* Style and doc

* Fix linting and simplify from_flatbuf.
2017-12-14 20:47:54 -08:00
Richard Liaw
c5c83a4465
[rllib] PPO and A3C unification (#1253) 2017-12-14 01:08:23 -08:00
Richard Liaw
cabbd27c56
[rllib] Support Nested Configuration Merging (#1268) 2017-12-13 14:39:01 -08:00
Robert Nishihara
f75b51d178 Register Common.error with local scheduler extension module. (#1316)
* Register Common.error with local scheduler extension module.

* Add test.
2017-12-13 11:55:54 -08:00
Richard Liaw
b6a35e0395 [rllib] Introduce pip install rllib (#1310)
* update setup

* more dependencies
2017-12-12 13:58:28 -08:00
Robert Nishihara
b1d89026cd Make ActorMethod fields private to fix tab completion. (#1312) 2017-12-12 10:07:33 -08:00
Peter Schafhalter
20d6b74aa6 [rllib] Added evaluation script to RLLib (#1295) 2017-12-11 11:59:44 -08:00
Robert Nishihara
96c46d35ff Tell Ray how to serialize FunctionSignature objects. (#1308) 2017-12-10 22:40:28 -08:00
Eric Liang
7009538321 Autodetect the number of GPUs when starting Ray. (#1293)
* autodetect

* Wed Dec  6 12:46:52 PST 2017

* Wed Dec  6 12:47:54 PST 2017

* Move GPU autodetection into services.py.

* Fix capitalization of Nvidia.

* Update documentation.
2017-12-09 15:30:16 -08:00
Robert Nishihara
6aae9a12fb Improve version checking at startup. (#1307)
* Check pyarrow version at startup.

* For version check, use absolute path to ray module.
2017-12-09 14:20:56 -08:00
Robert Nishihara
96463c680c Allow actor methods to return multiple object IDs. (#1296)
* Allow actor methods to return multiple object IDs.

* Add test.

* Fixes

* Remove outdated comment.

* Add comment and assert
2017-12-09 10:37:57 -08:00
Zongheng Yang
7e4a28f933 [rllib] Add tuned_examples/pong-ppo.yaml (#1302)
* Add tuned_examples/pong-ppo.yaml: 21 rew in ~3380sec

* Header comments
2017-12-09 01:20:22 -08:00
John Schulman
2606001a36 allow users to disable the webui (#1306)
* allow users to disable the webui

* Remove trailing whitespace.
2017-12-09 00:35:55 -08:00
Robert Nishihara
5adbdfecd0 Raise exception if pyarrow is imported before ray. (#1283)
* Raise exception if pyarrow is imported before ray.

* Pip install pyarrow when building doc so we don't have to mock it.

* Raise ImportError instead of Exception.
2017-12-08 03:34:54 -08:00
Richard Liaw
2e0eb0e4c7
[rllib] Adding dependencies (#1298) 2017-12-08 01:57:19 -08:00
Philipp Moritz
26125e1547 Fixing the jenkins tests (#1299)
* trying to fix jenkins tests

* comment out more tests

* remove pytorch stuff

* use non-monotonic clock (monotonic not supported on python 2.7)

* whitespace
2017-12-07 17:03:58 -08:00
Eric Liang
35f7398666
[rllib] Update RLlib docs and README (#1288)
Updates the rllib docs and README.
2017-12-06 18:17:51 -08:00
Eric Liang
2d543b6e19
[rllib] Refactor DQN to use an Evaluator abstraction (#1276)
This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
2017-12-06 17:51:57 -08:00
Robert Nishihara
c21e189371 Allow scheduling with arbitrary user-defined resource labels. (#1236)
* Enable scheduling with custom resource labels.

* Fix.

* Minor fixes and ref counting fix.

* Linting

* Use .data() instead of .c_str().

* Fix linting.

* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.

* Sleep in test so that all tasks are submitted before any completes.
2017-12-01 11:41:40 -08:00
Richard Liaw
483dee2ff3
[rllib] Generalizing A3C Sampling Classes (#1250) 2017-11-30 00:22:25 -08:00
Robert Nishihara
dd45664ab5 Bump version number to 0.3.0. (#1247) 2017-11-27 23:02:29 -08:00
Eric Liang
37831ae0c3 Add a nicer warning message when you pass the wrong thing to ray.wait() (#1239)
* add warnings

* fix python mode

* Small changes and add tests.

* Fix test failure.
2017-11-27 22:57:33 -08:00
Robert Nishihara
c1496b8111 Check version info in ray start for non-head nodes. (#1264)
* Check version info in ray start for non-head nodes.

* Small fix.

* Fix

* Push error to all drivers when worker has version mismatch.

* Linting

* Linting

* Fix

* Unify methods.

* Fix bug.
2017-11-27 22:03:38 -08:00
Richard Liaw
5e37cb8e16 Small PPO bug (#1265) 2017-11-27 17:52:25 -08:00
Robert Nishihara
f7c4f41df8 Change Python Redis client psubscribe -> subscribe. (#1261) 2017-11-26 23:29:37 -08:00
Robert Nishihara
2865128df0 Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260)
* Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines.

* Fix linting.
2017-11-26 18:29:10 -08:00
Robert Nishihara
0b4961b161 Provide flag for setting redis maxclients. (#1257)
* Add flag for attempting to increase ulimit -n and the redis maxclients.

* Don't bother trying to set ulimit -n.

* Fix linting.

* Add basic test.
2017-11-26 18:25:55 -08:00
Eric Liang
7fc2ddbaf7 Revert "[rllib] Use NoFilter instead of MeanStdFilter for PPO. (#1082)" (#1255)
This reverts commit 971becc905.
2017-11-26 16:00:46 -08:00
Robert Nishihara
e583d5a421 Give warnings for unimplemented Python mode methods. (#1256) 2017-11-26 13:11:12 -08:00