Commit graph

1058 commits

Author SHA1 Message Date
Robert Nishihara
1fe49d7676 Simplify testMultipleLocalSchedulers by having it start only one worker. (#789) 2017-07-31 17:44:45 -07:00
Eric Liang
b6a18cb39b [rllib] Also refactor DQN to use shared RLlib models (#730)
* wip

* works with cartpole

* lint

* fix pg

* comment

* action dist rename

* preprocessor

* fix test

* typo

* fix the action[0] nonsense

* revert

* satisfy the lint

* wip

* works with cartpole

* lint

* fix pg

* comment

* action dist rename

* preprocessor

* fix test

* typo

* fix the action[0] nonsense

* revert

* satisfy the lint

* Minor indentation changes.

* fix merge

* add humanoid

* initial dqn refactor

* remove tfutil

* fix calls

* fix tf errors 1

* closer

* runs now

* lint

* tensorboard graph

* fix linting

* more 4 space

* fix

* fix linT

* more lint

* oops

* es parity

* remove example.py

* fix training bug

* add cartpole demo

* try fixing cartpole

* allow model options, configure cartpole

* debug

* simplify

* no dueling

* avoid out of file handles

* Test dqn in jenkins.

* Minor formatting.

* fix issue

* fix another

* Fix problem in which we log to a directory that hasn't been created.
2017-07-26 12:29:00 -07:00
Robert Nishihara
8ad9ced99b Fix task ID hash computation. (#774) 2017-07-26 10:08:38 -07:00
alanamarzoev
0f0acb8ac1 CPU Time Series. (#765)
Add time series of CPU utilization to web UI.
2017-07-26 00:15:50 -07:00
Yeolar
31329d43dd fixtypo: plasma_protocol (#764)
Fix typo in plasma_protocol.
2017-07-22 17:52:27 -07:00
Robert Nishihara
ff996330e8 If a worker dies unexpectedly, then let it exit. (#762) 2017-07-21 06:36:25 +00:00
Robert Nishihara
13000b7503 Start processes using the same version of Python that was used to start Ray. (#760)
* Make local scheduler start workers using the same version of Python that was used to start the local scheduler.

* Use current version of python to start new processes instead of hardcoded python executable.

* Fix linting.
2017-07-21 00:05:10 +00:00
alanamarzoev
c31c20ca9c Code toggling instructions. (#757) 2017-07-20 10:51:33 -07:00
alanamarzoev
853b2913b7 Task duration distribution plot. (#743)
* Task duration distribution plot.

* Fixed bug.

* Changed axis labels.

* Modify task start point.

* Modified task_profiles func to decode in ascii.

* Nvm

* Changed to double quotes and added comments.

* fixed linting

* Fixed linting.

* Fixed bug.
2017-07-19 23:15:17 -07:00
Philipp Moritz
d356dd3ec4 [rllib] Expose algorithm parameters and tune policy gradient parameters for humanoid (#753)
* parameters for humanoid

* fix
2017-07-19 16:45:05 -07:00
Philipp Moritz
ade6d80820 [rllib] use ray.wait to speed up parallel simulations for policy gradients (#754)
* use ray.wait to speed up parallel simulations for policy gradients

* linting
2017-07-19 16:09:15 -07:00
alanamarzoev
2b3190ad13 Chrome trace timeline with sliders. (#731)
* Trace timeline with sliders.

* Trace.

* Switched ujson to json.

* Fixed tests.

* linting fixes

* Fixed bug.

* Cleaned up code.

* Fixes according to comments.

* removed checkpoints.

* Undid accidental delete.

* Fixed linting error.

* Added documentation to notebook.

* Undid accidental deletes.

* Add comments and small formatting fixes.

* Small fix.
2017-07-17 19:59:49 -07:00
Eric Liang
420013774c [rllib] Pull out shared models for evolution strategies and policy gradient. (#719)
* wip

* works with cartpole

* lint

* fix pg

* comment

* action dist rename

* preprocessor

* fix test

* typo

* fix the action[0] nonsense

* revert

* satisfy the lint

* wip

* works with cartpole

* lint

* fix pg

* comment

* action dist rename

* preprocessor

* fix test

* typo

* fix the action[0] nonsense

* revert

* satisfy the lint

* Minor indentation changes.

* fix merge

* add humanoid

* fix linting

* more 4 space

* fix

* fix linT

* oops

* es parity
2017-07-17 08:58:54 +00:00
Crystal
8fc7dc3ed4 Change Python examples in documentation to use 4 space indentation. (#736)
* Ray doc - changed python indentation to 4 spaces in documentation files actors.rst, api.rst, and example-*.rst

* Ray documentation - changed Python to 4 space indentation for files install-*.rst, installation-troubleshooting.rst, internals-overview.rst, serialization.rst, troubleshootin.rst, tutorial.rst, using-ray-*.rst
2017-07-16 22:19:33 -07:00
Eric Liang
86a7909149 make es worker count independent (#740) 2017-07-16 16:23:56 -07:00
Robert Nishihara
80e8426b5e Test example applications and rllib in jenkins tests. (#707)
* Test example applications in Jenkins.

* Fix default upload_dir argument for Algorithm class.

* Fix evolution strategies.

* Comment out policy gradient example which doesn't seem to work.

* Set --env-name for evolution strategies.
2017-07-16 18:51:33 +00:00
Robert Nishihara
4349f1f966 Fix broken links in example documentation. (#732) 2017-07-14 20:31:53 +00:00
Robert Nishihara
e0867c8845 Switch Python indentation from 2 spaces to 4 spaces. (#726)
* 4 space indentation for actor.py.

* 4 space indentation for worker.py.

* 4 space indentation for more files.

* 4 space indentation for some test files.

* Check indentation in Travis.

* 4 space indentation for some rl files.

* Fix failure test.

* Fix multi_node_test.

* 4 space indentation for more files.

* 4 space indentation for remaining files.

* Fixes.
2017-07-13 21:53:57 +00:00
Robert Nishihara
310ba82131 Use miniconda for all travis tests. (#728)
* Use miniconda for all travis tests.

* Fix.

* Fix.
2017-07-13 16:23:04 +00:00
Philipp Moritz
c24c07613c [rllib] unify writing performance metrics and make it queryable (#708)
* write config to s3

* add train file

* write performance to S3

* writing needs to be fixed, replacing result.json at the moment

* update

* add experiment_id

* more logging and example queries

* update

* add info

* fill in other algorithms

* fix linting

* convert readme to rst

* fixes

* simplejson -> json

* make files executable

* edit README.rst

* unify storing logs in S3 and on local filesystem

* use 'info' entry in TrainingResult for algorithm specific info

* don't install smart_open with ray

* fixes

* linting fixes
2017-07-11 01:36:14 +02:00
alanamarzoev
8464d77c76 Change event logs to store one Redis ZSET per worker. (#705)
* Changing to zset

* Fixed bug.

* Fixed another bug.

* Modified task_profiles.

* Removed extra file.

* Modified task_profiles test.

* WIP

* WIP

* Undid changes

* Updated

* WIP

* Made changes according to comments.

* Removed unneeded print.

* Removed ujson usage.

* failing test

* tests passing

* Fixed linting errors and modified style.

* Fixed bug.

* Fixed linting

* Fixed according to comments.

* Redis crashing?

* Fixed linting

* Fixed linting
2017-07-09 01:42:29 +02:00
Eric Liang
cd12ea7e09 [rllib] Pull out the GPU-parallel optimizer from policy gradients to common (#711)
* refactor

* docs

* cleanup

* clean up more

* Update parallel.py

* add imports from future
2017-07-07 22:20:02 +00:00
Robert Nishihara
5b3d0c00f2 Create /tmp/ray directory in services.py. (#715) 2017-07-07 18:41:56 +00:00
Eric Liang
f012e597c2 [rllib] Basic port of baselines/deepq to rllib (#709)
* rllib v0

* fix imports

* lint

* comments

* update docs

* a3c wip

* a3c wip

* report stats

* update doc

* add common logdir attr

* name is too long

* fix small bug

* propagate exception on error

* fetch metrics

* initial port

* fix lint

* add right license

* port to common alg format

* fix lint

* rename dqn

* add imports from future

* fix lint
2017-07-07 18:37:00 +00:00
Robert Nishihara
6c45657280 Reset the SIGCHLD handler after forking a worker to avoid influencing the worker. (#713) 2017-07-07 14:50:37 +00:00
Eric Liang
66734847bb [rllib] Standardize writing output logs and other files to /tmp/ray (#706)
* rllib v0

* fix imports

* lint

* comments

* update docs

* a3c wip

* a3c wip

* report stats

* update doc

* add common logdir attr

* name is too long

* fix small bug

* propagate exception on error

* fetch metrics

* fix small nits
2017-07-03 16:01:47 +00:00
alanamarzoev
2b11a7bca2 Add task ID and object ID search boxes to web UI. (#704)
* Task search box.

* Cleaned up.

* Small reformatting.

* Add object table search box.
2017-07-01 17:48:23 -04:00
alanamarzoev
716469160e Enable dumping profiling information to timeline format viewable by chrome tracing. (#703)
* Chrome tracing timeline.

* Modified decode statement.

* Some cleanups and add test.

* Remove example.

* Fix test.
2017-06-30 12:14:11 -04:00
Eric Liang
2d81edfcdc [rllib] Move a3c implementation from examples/ to python/ray/rllib/ (#698)
* rllib v0

* fix imports

* lint

* comments

* update docs

* a3c wip

* a3c wip

* report stats

* update doc

* name is too long

* fix small bug

* propagate exception on error

* fetch metrics

* fix lint
2017-06-29 15:49:56 +00:00
Robert Nishihara
efce49cfbc Bump version to 0.1.2 in preparation for uploading wheels to PyPI. (#700) 2017-06-27 04:35:42 +00:00
Robert Nishihara
1941e0f7b1 Fix compilation on CentOS. (#699) 2017-06-26 05:54:21 +00:00
Robert Nishihara
0926550661 Remove -mtune and -march compiler flags. (#697) 2017-06-26 05:52:45 +00:00
Eric Liang
a674ec958c [rllib] Move policy gradient and evolution strategies algorithms from examples/ to ray/rllib/ (#694)
* rllib v0

* fix imports

* lint

* comments

* update docs
2017-06-25 22:13:03 +00:00
Robert Nishihara
8bc9c275fa Increase the number of log file names and handle errors better in log monitor. (#693) 2017-06-25 05:20:50 +00:00
Robert Nishihara
ad480f8165 Don't reconstruct all objects in every fetch request in local scheduler. (#686)
* Don't reconstruct all objects in every fetch request in local scheduler.

* Separate out fetch timer and reconstruction timer.

* Fix bug.

* Bug fix.

* Fix naming convention for global variables.

* Address comments.

* Make reconstruct_counter a static variable.

* Fix linting.

* Redo reconstruct handler using a set of objects to fetch.

* Fix linting.

* Replace set with vector.
2017-06-23 21:08:02 +00:00
alanamarzoev
e16df6da9a Updated task_profiles function to avoid future repetitive parsing. (#691)
* Updated task_profiles function to avoid future repetitive parsing.

* Fix indentation.

* Fixed according to comments.

* Included updated test for task_profiles function.

* Simplify test.

* Fix indentation.

* Fix.
2017-06-22 19:21:18 -07:00
Robert Nishihara
2d636d9278 Kill jupyter in ray stop. (#689)
* Kill jupyter in ray stop.

* Terminate jupyter notebook in ray stop.

* Fix linting.
2017-06-21 05:58:34 +00:00
Robert Nishihara
5bb07cb01b Remove old UI code. (#688) 2017-06-21 05:54:21 +00:00
Robert Nishihara
5ebc2f3f2e Do resource bookkeeping for actor methods. (#682)
* Dispatch regular and actor tasks when resources become available.

* Make actor methods do resource bookkeeping and add test.

* Remove unnecessary field.

* Fix linting.

* Fix actor test.

* Maintain set of actors with pending tasks to speed up task dispatch.

* Exit early from task dispatch if there are no resources available.

* Fix linting.

* Fix error.

* Fix bug related to iterator invalidation.

* When an actor is removed, remove it from the set of actors with pending tasks.
2017-06-21 05:52:45 +00:00
alanamarzoev
ed9380d73d Automatically start web UI in ray.init(). (#687)
* Start up webui on ray.init

* Removed .ipynb checkpoint folders.

* Removed print statements in cleanup function.

* Fixed

* Removed extra file.

* Cleaned up ui.

* Don't start browser automatically in ray.init(), also copy the notebook every time so that changes don't persist.

* Update setup.py and installation instructions to install jupyter.

* Don't automatically install jupyter, don't start the UI if jupyter is not installed.

* Improve error message when failing to start UI.
2017-06-20 10:32:55 -07:00
Robert Nishihara
3052ce25a6 Divide up large fetch requests from local scheduler, also print warni… (#683)
* Divide up large fetch requests from local scheduler, also print warning if fetch handler is slow.

* Fix linting.

* Fix typo.
2017-06-19 22:57:51 +00:00
Robert Nishihara
9e4a3e4972 Replace some UT data structures in local scheduler with C++ STL. (#680)
* Replace a local scheduler ut_array with a std::vector.

* Replace vector of sizes in local scheduler with std::pair.

* Remove utarray include.

* Replace utarray with std::vector for reading local scheduler input messages.

* Remove more UT data structures.

* Remove UT includes.

* Fix linting.

* Include stdlib.h to find size_t.

* Remove includes of stdbool.h.

* Replace std::pair with TaskQueueEntry.

* Fix redis tests.

* Reinstate tests.
2017-06-19 21:58:42 +00:00
Philipp Moritz
9bcaaaeaf5 Debugging for policy gradients (#681)
* configuration option for tensorflow debugger

* add model checkpointing

* fix linting

* make it possible to run without checkpointing

* fix

* loading from checkpoint and expose debugger through cli

* todo for filters

* Fix typo.
2017-06-18 17:58:41 -07:00
Robert Nishihara
f12db5f0e2 Divide large plasma requests into smaller chunks, and wait longer before reissuing large requests. (#678)
* Divide large get requests into smaller chunks.

* Divide fetches into smaller chunks.

* Wait longer in worker and manager before reissuing fetch requests if there are many outstanding fetch requests.

* Log warning if a handler in the local scheduler or plasma manager takes more than one second.
2017-06-18 04:42:15 +00:00
alanamarzoev
4d5ac9dad5 Include object size and hash in the table returned by the object_table function in the GlobalStateAPI. (#665)
* added log_table function and a test

* fixed log_files and added task_profiles

* fixed formatting

* fixed linting errors

* fixes

* removed file

* more fixes

* hopefully fixed

* Small changes.

* Fix linting.

* Fix bug in log monitor.

* Small changes.

* Fix bug in travis.

* Including data_size and hash in the ResultTableReply.

* Included data_size and hash info in object_table.

* Fixed bugs in ray_redis_module.cc.

* Removing commented out code.

* Fixes

* Freed hash and data_size strings after using, and checked if they're null along with task_id and is_put.

* Changed it so that data_size is set correctly.

* Removed iostream import.

* Included a check to ensure that the Redis string to long long conversion was successful.

* Included separate data_size and hash null checks.

* Fixed bug.

* Made linting changes.

* Another linting error.

* Slight simplication.
2017-06-16 23:17:11 -07:00
Robert Nishihara
019ba07e9c Correct actor class name and module. (#675)
* Correct actor class name and module.

* Add test.

* Fix linting.
2017-06-17 05:44:42 +00:00
Robert Nishihara
96962cdee0 Log fatal error if plasma manager or local scheduler heartbeats take too long. (#676)
* Log fatal error if plasma manager or local scheduler take too long to send heartbeat.

* Fix linting.

* Use int64_t for milliseconds since unix epoch.
2017-06-16 19:11:01 +00:00
Alexey Tumanov
8317025987 reducing the size of objects created for the global scheduler test (#674) 2017-06-15 10:02:46 -07:00
Philipp Moritz
8798f4e690 fix flaky mac os x plasma store component_failure_test (#673)
Fix flaky mac os x plasma store component_failure_test.
2017-06-15 00:31:50 -07:00
Philipp Moritz
c343df832e use multiple threads for memcpy (#669) 2017-06-14 19:14:24 -07:00