hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Devin Petersohn	4aca016bff	Adding series and a way to validate our API. (#1435 ) * Adding series and a way to validate our API. * Moving partitions into protected status	2018-01-21 19:20:54 -08:00
Stephanie Wang	74718efa73	Nondeterministic reconstruction for actors (#1344 ) * Add failing unit test for nondeterministic reconstruction * Retry scheduling actor tasks if reassigned to local scheduler * Update execution edges asynchronously upon dispatch for nondeterministic reconstruction * Fix bug for updating checkpoint task execution dependencies * Update comments for deterministic reconstruction * cleanup * Add (and skip) failing test case for nondeterministic reconstruction * Suppress test output	2018-01-21 13:44:13 -08:00
eugenevinitsky	37076a9ff8	Multiagent model using concatenated observations (#1416 ) * working multi action distribution and multiagent model * currently working but the splits arent done in the right place * added shared models * added categorical support and mountain car example * now compatible with generalized advantage estimation * working multiagent code with discrete and continuous example * moved reshaper to utils * code review changes made, ppo action placeholder moved to model catalog, all multiagent code moved out of fcnet * added examples in * added PEP8 compliance * examples are mostly pep8 compliant * removed all flake errors * added examples to jenkins tests * fixed custom options bug * added lines to let docker file find multiagent tests * shortened example run length * corrected nits * fixed flake errors	2018-01-18 19:51:31 -08:00
Richard Liaw	d4592382a4	[tune][minor] Fixes (#1383 )	2018-01-11 18:14:20 -08:00
Philipp Moritz	44792530a9	fix autoscaler test (#1411 )	2018-01-10 13:18:34 -08:00
Devin Petersohn	112ef07563	Adding all DataFrame methods with NotImplementedErrors (#1403 ) * Adding all DataFrame methods with NotImplementedErrors * Moving dataframe creation into function call	2018-01-07 12:00:16 -08:00
Eric Liang	b6c42f96be	Auto-scale ray clusters based on GCS load metrics (#1348 ) This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.	2017-12-31 14:39:57 -08:00
Devin Petersohn	a75a473d7f	Add a distributed Dataframe API to Ray (#1330 ) * Adding dataframe object and minor APIs * Adding reduce functionality * Adding some print and making reduce work on current Ray * Cleanup * Added new functionality and docs. * Adding more functionality. * New functionality with older cleanup * Complying with flake8 formatting * Added tests and addressed reviewer comments * Complying with flake8. * Adding pandas to travis and requirements doc * Fixing flake8 failures * Fixing flake8 errors from imports * Fixing import error * Fixing import errors * Addressing reviewer comments * Addressing lint error	2017-12-20 09:31:22 -08:00
Eric Liang	47b1f02d3e	[rllib] Pull out multi-gpu optimizer as a generic class (#1313 )	2017-12-17 15:59:57 -08:00
Eric Liang	f5ea44338e	EC2 cluster setup scripts and initial version of auto-scaler (#1311 )	2017-12-15 23:56:39 -08:00
Eric Liang	fbf1806b8a	[tune] Clean up result logging: move out of /tmp, add timestamp (#1297 )	2017-12-15 14:19:08 -08:00
Robert Nishihara	f75b51d178	Register Common.error with local scheduler extension module. (#1316 ) * Register Common.error with local scheduler extension module. * Add test.	2017-12-13 11:55:54 -08:00
Peter Schafhalter	20d6b74aa6	[rllib] Added evaluation script to RLLib (#1295 )	2017-12-11 11:59:44 -08:00
Robert Nishihara	96463c680c	Allow actor methods to return multiple object IDs. (#1296 ) * Allow actor methods to return multiple object IDs. * Add test. * Fixes * Remove outdated comment. * Add comment and assert	2017-12-09 10:37:57 -08:00
Philipp Moritz	26125e1547	Fixing the jenkins tests (#1299 ) * trying to fix jenkins tests * comment out more tests * remove pytorch stuff * use non-monotonic clock (monotonic not supported on python 2.7) * whitespace	2017-12-07 17:03:58 -08:00
Eric Liang	2d543b6e19	[rllib] Refactor DQN to use an Evaluator abstraction (#1276 ) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.	2017-12-06 17:51:57 -08:00
Robert Nishihara	c21e189371	Allow scheduling with arbitrary user-defined resource labels. (#1236 ) * Enable scheduling with custom resource labels. * Fix. * Minor fixes and ref counting fix. * Linting * Use .data() instead of .c_str(). * Fix linting. * Fix ResourcesTest.testGPUIDs test by waiting for workers to start up. * Sleep in test so that all tasks are submitted before any completes.	2017-12-01 11:41:40 -08:00
Eric Liang	37831ae0c3	Add a nicer warning message when you pass the wrong thing to ray.wait() (#1239 ) * add warnings * fix python mode * Small changes and add tests. * Fix test failure.	2017-11-27 22:57:33 -08:00
Robert Nishihara	2865128df0	Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260 ) * Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines. * Fix linting.	2017-11-26 18:29:10 -08:00
Robert Nishihara	0b4961b161	Provide flag for setting redis maxclients. (#1257 ) * Add flag for attempting to increase ulimit -n and the redis maxclients. * Don't bother trying to set ulimit -n. * Fix linting. * Add basic test.	2017-11-26 18:25:55 -08:00
Robert Nishihara	7af5292646	Give error if a worker has a version mismatch for Python Ray, or clou… (#1245 ) * Give error if a worker has a version mismatch for Python Ray, or cloudpickle. * Check version when attaching driver to cluster. * Only do check if the version info is present. * Bug fix. * Fix typo.	2017-11-23 23:31:03 -08:00
Robert Nishihara	477a40f76d	Prohibit returning actor handles and also update actor documentation. (#1246 ) * Prohibit returning actor handles and also update actor documentation. * Clarify documentation.	2017-11-23 09:37:24 -08:00
shane	9af8dc568a	testing with --rm and docker run (#1240 ) Add --rm to docker run for Jenkins tests.	2017-11-22 10:20:04 -08:00
Eric Liang	316f9e2bb7	[tune] Support user-defined trainable functions / classes / envs with a shared object registry (#1226 )	2017-11-20 17:52:43 -08:00
Eric Liang	9233e496cc	Raise exception when getting the task results of workers that died (#1224 ) * wip * with test * add timeout * also add test for f * remove on cleanup * update * wip * fix tests * mark actor removed in redis * clang-format * fix bug when no-inprogress tasks * try to set task status done * Add comment.	2017-11-20 15:18:39 -08:00
Eric Liang	28f1e12940	[rllib] [build-fix] ES iterations get unexpectedly long (#1235 ) * fix very long es * Revert prior change. * Shorten ES jenkins tests.	2017-11-20 14:42:42 -08:00
Robert Nishihara	0eae917766	[rllib] Clean up evolution strategies example. (#1225 ) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment.	2017-11-16 21:58:30 -08:00
Richard Liaw	eadb998643	[tune] Make HyperBand Usable (#1215 )	2017-11-16 10:31:42 -08:00
Richard Liaw	71f8cd2403	[tune] Fixing up Hyperband (#1207 ) * Fixing up Hyperband * nit * cleanup * Timing test Added * added_exception_back * fixup_tests * reverse placement * fixes_and_tests * fix * fix * fixlint * cleanup_timing * lint * Update hyperband.py	2017-11-12 12:05:32 -08:00
Eric Liang	7c38f964b7	[tune] Add command line support for choosing early stopping schedulers (#1209 ) * command line support * add checkpoint freq * fix other flags * fix * docs * doc	2017-11-12 12:05:18 -08:00
Richard Liaw	afdc87323f	[rllib] PyTorch Models for A3C (#1187 ) * fixing policy * Compute Action is singular, fixed weird issue with arrays * remove vestige * extraneous ipdb * Can Drop in Pytorch Model * lint * introducing models * fix base policy * Missed this from last time * lint * removedolds * getting vision working * LINT * trying to fix test dependencies * requiremnets * try * tryconda * yes * shutup * flake_passes * changes * removing weight initializer for lstm for now * unused * adam * clip * zero * properscaling * weight * try * fix up pytorch visionnet * bias correction * fix model * same visionnet * matching_bad_things * test * try locking * fixing_linear * naming * lint * FORJENKINS * clouds * lint * Lint + removed dependencies * removed dependencies * format	2017-11-12 00:20:33 -08:00
Daniel Suo	4f0da6f81c	Add basic functionality for Cython functions and actors (#1193 ) * Add basic functionality for Cython functions and actors * Fix up per @pcmoritz comments * Fixes per @richardliaw comments * Fixes per @robertnishihara comments * Forgot double quotes when updating masked_log * Remove import typing for Python 2 compatibility	2017-11-09 17:49:06 -08:00
Richard Liaw	6197b260b8	Fix Jenkins issue introduced by Variant Generator (#1194 ) * try fix * shorten * added a flag * finish * Fix linting.	2017-11-09 00:56:20 -08:00
Eric Liang	52888e4c6f	[tune] Improve the tune Python API and variant generation (#1154 ) * new variant gen * wip * Sat Oct 21 18:21:34 PDT 2017 * update * comment * fix * update * update readme * fix * Update README.rst * Update README.rst * fix repeat * update * note on restore	2017-11-06 23:41:17 -08:00
Richard Liaw	6222ec3bd7	[tune] hyperband (#1156 ) * trial scheduler interface * remove * wip median stopping * remove * median stopping rule * update * docs * update * Revrt * update * hyperband untested * small changes before moving on * added endpoints * good changes * init tests * smore tests * unfinished tests * testing * testing code * morbugs * fixes * end * tests and typo * nit * try this * tests * testing * lint * lint * lint * comments and docs * almost screwed up * lint	2017-11-06 22:30:25 -08:00
Eric Liang	d06beacd84	[tune] Implement median stopping rule (#1170 ) * trial scheduler interface * remove * wip median stopping * remove * median stopping rule * update * docs * update * Revrt * update * comments * fix tesT	2017-11-03 11:25:02 -07:00
Robert Nishihara	3317d38278	Replace hostnames with numerical IP addresses in redis address. (#1177 ) * Replace hostnames with numerical IP addresses in redis address. * Also do conversion for node_ip_address. Add test. * Simplifications.	2017-11-01 17:13:22 -07:00
Robert Nishihara	6852e8839e	Expose custom serializers through the API. (#1147 ) * Expose custom serializers through the API. * minor renaming * Add test. * Remove comment. * Clean up assertions.	2017-10-29 00:08:55 -07:00
Richard Liaw	797f4fcbf3	Fixing Lint after flake upgrade (#1162 ) * Fixing Lint after flake upgrade * more lint fixes	2017-10-26 21:02:07 -05:00
Eric Liang	cd9dc398ff	[rllib] Support discrete observation spaces such as FrozenLake-v0 (#1140 ) * add * remove transform_shape * fix test * fix	2017-10-23 23:16:52 -07:00
Richard Liaw	0c9817fa76	[tune] Tune Pausing (#1136 ) * fix yaml bug * add ext agent * gpus * update * tuning * docs * Sun Oct 15 21:09:25 PDT 2017 * lint * update * Sun Oct 15 22:39:55 PDT 2017 * Sun Oct 15 22:40:17 PDT 2017 * Sun Oct 15 22:43:06 PDT 2017 * Sun Oct 15 22:46:06 PDT 2017 * Sun Oct 15 22:46:21 PDT 2017 * Sun Oct 15 22:48:11 PDT 2017 * Sun Oct 15 22:48:44 PDT 2017 * Sun Oct 15 22:49:23 PDT 2017 * Sun Oct 15 22:50:21 PDT 2017 * Sun Oct 15 22:53:00 PDT 2017 * Sun Oct 15 22:53:34 PDT 2017 * Sun Oct 15 22:54:33 PDT 2017 * Sun Oct 15 22:54:50 PDT 2017 * Sun Oct 15 22:55:20 PDT 2017 * Sun Oct 15 22:56:56 PDT 2017 * Sun Oct 15 22:59:03 PDT 2017 * fix * Update tune_mnist_ray.py * remove script trial * fix * reorder * fix ex * py2 support * upd * comments * comments * cleanup readme * fix trial * annotate * Update rllib.rst * init pausing * Docs, Lint * fix danglings and restore endpoint moved to trialrunner * renaming * nit * start always starts from checkpoint * smalls * nits * lint * last change	2017-10-22 23:04:15 -07:00
Eric Liang	81ca27dc08	[rllib] [minor] Rename agent_id to experiment_tag (#1143 ) * tagstr * doc * rename * fix test	2017-10-22 18:44:18 -07:00
Stephanie Wang	af47737bd5	Prototype distributed actor handles (#1137 ) * Add actor handle ID to the task spec * Local scheduler dispatches actor tasks according to a task counter per handle * Fix python test * Allow passing actor handles into tasks. Not completely working yet. Also this is very messy. * Fixes, should be roughly working now. * Refactor actor handle wrapper * Fix __init__ tests * Terminate actor when the original handle goes out of scope * TODO and a couple test cases * Make tests for unsupported cases * Fix Python mode tests * Linting. * Cache actor definitions that occur before ray.init() is called. * Fix export actor class * Deterministically compute actor handle ID * Fix __getattribute__ * Fix string encoding for python3 * doc * Add comment and assertion.	2017-10-19 23:49:59 -07:00
Eric Liang	5a50e0e1d7	[rllib] Add the ability to run arbitrary Python scripts with ray.tune (#1132 ) * fix yaml bug * add ext agent * gpus * update * tuning * docs * Sun Oct 15 21:09:25 PDT 2017 * lint * update * Sun Oct 15 22:39:55 PDT 2017 * Sun Oct 15 22:40:17 PDT 2017 * Sun Oct 15 22:43:06 PDT 2017 * Sun Oct 15 22:46:06 PDT 2017 * Sun Oct 15 22:46:21 PDT 2017 * Sun Oct 15 22:48:11 PDT 2017 * Sun Oct 15 22:48:44 PDT 2017 * Sun Oct 15 22:49:23 PDT 2017 * Sun Oct 15 22:50:21 PDT 2017 * Sun Oct 15 22:53:00 PDT 2017 * Sun Oct 15 22:53:34 PDT 2017 * Sun Oct 15 22:54:33 PDT 2017 * Sun Oct 15 22:54:50 PDT 2017 * Sun Oct 15 22:55:20 PDT 2017 * Sun Oct 15 22:56:56 PDT 2017 * Sun Oct 15 22:59:03 PDT 2017 * fix * Update tune_mnist_ray.py * remove script trial * fix * reorder * fix ex * py2 support * upd * comments * comments * cleanup readme * fix trial * annotate * Update rllib.rst	2017-10-18 11:49:28 -07:00
Eric Liang	802941994d	[rllib] Use RLlib preprocessors in DQN (fixes PongDeterministic-v4) (#1124 ) * fix pong * rename * update	2017-10-14 20:16:36 -07:00
Stephanie Wang	15486a14a0	Refactor actor task queues (#1118 ) * Refactor add_task_to_actor_queue into queue_actor_task and insert_actor_task_queue * Refactor actor task queue to share the waiting task queue * Fix	2017-10-13 20:52:11 -07:00
Eric Liang	79ea205b3e	[rllib] Initial work on integrating hyperparameter search tool (#1107 ) * clean up train * update * update train script * add tuned examples * add agent catalog * add tune lib * update * fix * testS * remove * train docs * comments * todo * fix resource parsing * fix cr test * add test * try to fix travis test	2017-10-13 16:18:16 -07:00
Stephanie Wang	3764f2f2e1	Actor checkpointing with object lineage reconstruction (#1004 ) * Worker reports error in previous task, actor task counter is incremented after task is successful * Refactor actor task execution - Return new task counter in GetTaskRequest - Update worker state for actor tasks inside of the actor method executor * Manually invoked checkpoint method * Scheduling for actor checkpoint methods * Fix python bugs in checkpointing * Return task success from worker to local scheduler instead of actor counter * Kill local schedulers halfway through actor execution instead of waiting for all tasks to execute once * Remove redundant actor tasks during dispatch, reconstruct missing dependencies for actor tasks * Make executor for temporary actor methods * doc * Set default argument for whether the previous task was a success * Refactor actor method call * Simplify checkpoint task submission * lint * fix philipp's comments * Add missing line * Make actor reconstruction tests run faster * Unimportant whitespace. * Unimportant whitespace. * Update checkpoint method signature * Documentation and handle exceptions during checkpoint save/resume * Rename get_task message field to actor_checkpoint_failed * Fix bug. * Remove debugging check, redirect test output	2017-10-12 09:53:32 -07:00
Robert Nishihara	7a954f4b5f	Use monotonic clock for some python tests. (#1112 )	2017-10-11 19:58:59 -07:00
Robert Nishihara	a52a1e893f	Automatically set CUDA_VISIBLE_DEVICES when worker gets task. (#1044 ) * Automatically set CUDA_VISIBLE_DEVICES when worker gets task. * Add test.	2017-10-06 18:38:08 -07:00

1 2 3 4 5 ...

358 commits