hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Philipp Moritz	7b493aa4a1	Register credis with redis (#1730 )	2018-03-18 14:02:19 -07:00
Christian Barra	070e27ea7a	Add external module as a node scaler. (#1703 ) * WIP: add external module as a node scaler. * Fix style. * Add tests, fix style issues. * Fix typos. * Fix test error. * Fix node provider path. * Add function to spli pkg from class. * Add doc. * Correct documentation. * Debugging.... * Debugging.... * Add __init__.py to tests. * add more output for debugging * Add more test, fix error with import. * Add a small detail to the documentation. * Update autoscaler.py	2018-03-17 16:59:13 -07:00
Richard Liaw	9b361115c3	[tune] Added Async HyperBand example (#1709 )	2018-03-16 13:25:29 -07:00
Robert Nishihara	96913be939	Treat actor creation like a regular task. (#1668 ) * Treat actor creation like a regular task. * Small cleanups. * Change semantics of actor resource handling. * Bug fix. * Minor linting * Bug fix * Fix jenkins test. * Fix actor tests * Some cleanups * Bug fix * Fix bug. * Remove cached actor tasks when a driver is removed. * Add more info to taskspec in global state API. * Fix cyclic import bug in tune. * Fix * Fix linting. * Fix linting. * Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs. * Bug fix. * Add test for 0 CPU case * Fix linting * Address comments. * Fix typos and add comment. * Add assertion and fix test.	2018-03-16 11:18:07 -07:00
Philipp Moritz	a9acfab3a6	Start chain replicated GCS with Ray (#1538 )	2018-03-07 10:18:58 -08:00
Richard Liaw	162d063f0d	[autoscaler/tune] Optional YAML Fields + Fix Pretty Printing for Tune (#1541 )	2018-03-04 23:35:58 -08:00
Richard Liaw	78716094b5	[tune] Async Hyperband (#1595 )	2018-03-04 14:05:56 -08:00
Eric Liang	ecb811c26e	[rllib] Ape-X implementation and DQN refactor to handle replay in policy optimizer (#1604 ) * minimal apex checkin * cleanup dqn options * actor utils * Sun Feb 25 17:39:54 PST 2018 * update * compression refactor * fix * add test * fix models * Sun Feb 25 21:46:27 PST 2018 * Wed Feb 28 10:26:34 PST 2018 * Wed Feb 28 10:28:09 PST 2018 * Wed Feb 28 10:42:59 PST 2018 * refactor * Wed Feb 28 11:17:19 PST 2018 * Wed Feb 28 11:42:08 PST 2018 * Wed Feb 28 11:42:13 PST 2018 * Wed Feb 28 11:59:02 PST 2018 * Wed Feb 28 11:59:58 PST 2018 * Wed Feb 28 12:00:08 PST 2018 * Wed Feb 28 12:02:19 PST 2018 * Wed Feb 28 13:44:31 PST 2018 * Wed Feb 28 17:01:20 PST 2018 * Sat Mar 3 14:55:59 PST 2018 * make optimizer construction explicit * Sat Mar 3 18:23:08 PST 2018 * Sat Mar 3 18:24:28 PST 2018 * Sat Mar 3 18:49:28 PST 2018 * Sat Mar 3 18:50:42 PST 2018 * Sat Mar 3 18:56:10 PST 2018	2018-03-04 12:25:25 -08:00
Eric Liang	80d7def9dc	[autoscaler] [tune] More doc fixes (#1560 ) * Fri Feb 16 13:53:50 PST 2018 * Sat Feb 17 15:32:08 PST 2018 * Sat Feb 17 15:44:59 PST 2018 * fix * Sun Feb 18 14:46:24 PST 2018 * Sun Feb 18 14:46:37 PST 2018 * Sun Feb 18 14:55:52 PST 2018 * Sun Feb 18 15:14:32 PST 2018 * Wed Feb 21 17:34:17 PST 2018 * Sun Feb 25 17:51:17 PST 2018 * Sun Feb 25 22:18:40 PST 2018 * Wed Feb 28 13:19:05 PST 2018 * Wed Feb 28 13:22:13 PST 2018 * Wed Feb 28 13:33:29 PST 2018 * Wed Feb 28 13:35:33 PST 2018 * add ex * Fri Mar 2 12:50:17 PST 2018 * Fri Mar 2 12:54:31 PST 2018	2018-03-03 13:01:49 -08:00
Richard Liaw	c2ad800cbf	[rllib] Registry fix for DQN Replay Evaluators (#1593 )	2018-02-25 22:30:11 -08:00
Robert Nishihara	330159d8bd	Allow setting redis shard ports through ray start (also object store memory). (#1581 ) * Allow passing in --object-store-memory to ray start. * Allow setting ports for the redis shards. * Reorder arguments and infer number of shards from ports. * Move code block into only the head node case. * Add test.	2018-02-22 11:05:37 -08:00
Richard Liaw	1cd2703cac	[autoscaler] Docker Support (#1505 )	2018-02-20 00:24:01 -08:00
Alexey Tumanov	844a6afcdd	Implement simple random spillback policy. (#1493 ) * spillback policy implementation: global + local scheduler * modernize global scheduler policy state; factor out random number engine and generator * Minimal version. * Fix test. * Make load balancing test less strenuous.	2018-02-13 00:09:35 -08:00
William Paul	f2b6a7b58d	Polished TensorFlowVariables code and documentation (#566 )	2018-02-12 15:38:58 -08:00
alvkao58	81a4be8f65	[rllib] Added vanilla policy gradient (#1497 )	2018-02-10 13:54:51 -08:00
Stephanie Wang	ff8e7f8259	Actor checkpointing for distributed actor handles (#1498 ) * Expose calls to get and set the actor frontier * Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded * Prototype for actor checkpointing * Filter out duplicate tasks on the local scheduler * Clean up some of the Python checkpointing code * More cleanups * Documentation * cleanup and fix unit test * Allow remote checkpoint calls through actor handle * Check whether object is local before reconstructing * Enable checkpointing for distributed actor handles, refactor tests * Fix local scheduler tests * lint * Address comments * lint * Skip tests that fail on new GCS * style * Don't put same object twice when setting the actor frontier * Address Philipp's comments, cleaner fbs naming	2018-02-07 11:19:32 -08:00
Eric Liang	b948405532	[tune] clean up population based training prototype (#1478 ) * patch up pbt * Sat Jan 27 01:00:03 PST 2018 * Sat Jan 27 01:04:14 PST 2018 * Sat Jan 27 01:04:21 PST 2018 * Sat Jan 27 01:15:15 PST 2018 * Sat Jan 27 01:15:42 PST 2018 * Sat Jan 27 01:16:14 PST 2018 * Sat Jan 27 01:38:42 PST 2018 * Sat Jan 27 01:39:21 PST 2018 * add pbt * Sat Jan 27 01:41:19 PST 2018 * Sat Jan 27 01:44:21 PST 2018 * Sat Jan 27 01:45:46 PST 2018 * Sat Jan 27 16:54:42 PST 2018 * Sat Jan 27 16:57:53 PST 2018 * clean up test * Sat Jan 27 18:01:15 PST 2018 * Sat Jan 27 18:02:54 PST 2018 * Sat Jan 27 18:11:18 PST 2018 * Sat Jan 27 18:11:55 PST 2018 * Sat Jan 27 18:14:09 PST 2018 * review * try out a ppo example * some tweaks to ppo example * add postprocess hook * Sun Jan 28 15:00:40 PST 2018 * clean up custom explore fn * Sun Jan 28 15:10:21 PST 2018 * Sun Jan 28 15:14:53 PST 2018 * Sun Jan 28 15:17:04 PST 2018 * Sun Jan 28 15:33:13 PST 2018 * Sun Jan 28 15:56:40 PST 2018 * Sun Jan 28 15:57:36 PST 2018 * Sun Jan 28 16:00:35 PST 2018 * Sun Jan 28 16:02:58 PST 2018 * Sun Jan 28 16:29:50 PST 2018 * Sun Jan 28 16:30:36 PST 2018 * Sun Jan 28 16:31:44 PST 2018 * improve tune doc * concepts * update humanoid * Fri Feb 2 18:03:33 PST 2018 * fix example * show error file	2018-02-02 23:03:12 -08:00
Robert Nishihara	ed77a4c415	Make ray.get_gpu_ids() respect existing CUDA_VISIBLE_DEVICES. (#1499 ) * Make ray.get_gpu_ids() respect existing CUDA_VISIBLE_DEVICES. * Comment out failing GPUID check. * Add import. * Fix test. * Remove test. * Factor out environment variable setting/getting into utils.	2018-02-01 21:29:14 -08:00
Philipp Moritz	a3f8fa426b	Start integrating new GCS APIs (#1379 ) * Start integrating new GCS calls * fixes * tests * cleanup * cleanup and valgrind fix * update tests * fix valgrind * fix more valgrind * fixes * add separate tests for GCS * fix linting * update tests * cleanup * fix python linting * more fixes * fix linting * add plasma manager callback * add some documentation * fix linting * fix linting * fixes * update * fix linting * fix * add spillback count * fixes * linting * fixes * fix linting * fix * fix * fix	2018-01-31 11:01:12 -08:00
Robert Nishihara	4c6dae5517	Raise an exception in Jenkins tests after a timeout. (#1477 )	2018-01-27 20:21:27 -08:00
Robert Nishihara	3195c6aa63	Fix local scheduler crash when driver creates actor and exits. (#1474 ) * Make check failures in redis.cc more informative. * Fix bug by calling task_table_add_task. * Add test.	2018-01-26 14:29:53 -08:00
Kaahan	7aa979a024	[tune] Added Population Based Training (#1355 ) Adds a Population-Based Training (as described in https://arxiv.org/abs/1711.09846) scheduler to Ray.tune. Currently mutates hyperparameters according to either a user-defined list of possible values to mutate to (necessary if hyperparameters can only be certain values ex. sgd_batch_size), or by a factor of 0.8 or 1.2.	2018-01-25 21:38:37 -08:00
Richard Liaw	e5c4d9ea0c	[tune] Fix Trial Logging File name (#1466 )	2018-01-25 17:57:40 -08:00
Robert Nishihara	ab5d4a6010	Bring cloudpickle inside the repository. (#1445 ) * Bring cloudpickle version 0.5.2 inside the repo. * Use internal copy of cloudpickle everywhere. * Fix linting. * Import ordering. * Change __init__.py. * Set pickler in serialization context. * Don't check ray location.	2018-01-25 11:36:37 -08:00
Eric Liang	173f1d629a	[tune] Ray Tune API cleanup (#1454 ) Remove rllib dep: trainable is now a standalone abstract class that can be easily subclassed. Clean up hyperband: fix debug string and add an example. Remove YAML api / ScriptRunner: this was never really used. Move ray.init() out of run_experiments(): This provides greater flexibility and should be less confusing since there isn't an implicit init() done there. Note that this is a breaking API change for tune.	2018-01-24 16:55:17 -08:00
Richard Liaw	a7d544424c	[tune] Experiment Management API (#1328 ) * init for exposing external interface * revisions * http server * small * simplify * ui * fixes * test * nit * nit * merge * untested * nits * nit * init tests * tests * more tests * nit * fix hyperband * cleanup * nits * good stuff * cleanup * comments and need to test * nit * notebook * testing * test and expose server * server_tests * docs * periods * fix tests * committing test * fi	2018-01-24 13:45:10 -08:00
Eric Liang	1d2a28ab07	[rllib] test all combinations of {obs_space} x {action_space} (#1449 )	2018-01-24 11:03:43 -08:00
Robert Nishihara	f32c0c8ec1	Move calls to ray.worker.cleanup into tearDown part of tests for isolation. (#1433 )	2018-01-22 22:54:56 -08:00
Devin Petersohn	4aca016bff	Adding series and a way to validate our API. (#1435 ) * Adding series and a way to validate our API. * Moving partitions into protected status	2018-01-21 19:20:54 -08:00
Stephanie Wang	74718efa73	Nondeterministic reconstruction for actors (#1344 ) * Add failing unit test for nondeterministic reconstruction * Retry scheduling actor tasks if reassigned to local scheduler * Update execution edges asynchronously upon dispatch for nondeterministic reconstruction * Fix bug for updating checkpoint task execution dependencies * Update comments for deterministic reconstruction * cleanup * Add (and skip) failing test case for nondeterministic reconstruction * Suppress test output	2018-01-21 13:44:13 -08:00
eugenevinitsky	37076a9ff8	Multiagent model using concatenated observations (#1416 ) * working multi action distribution and multiagent model * currently working but the splits arent done in the right place * added shared models * added categorical support and mountain car example * now compatible with generalized advantage estimation * working multiagent code with discrete and continuous example * moved reshaper to utils * code review changes made, ppo action placeholder moved to model catalog, all multiagent code moved out of fcnet * added examples in * added PEP8 compliance * examples are mostly pep8 compliant * removed all flake errors * added examples to jenkins tests * fixed custom options bug * added lines to let docker file find multiagent tests * shortened example run length * corrected nits * fixed flake errors	2018-01-18 19:51:31 -08:00
Richard Liaw	d4592382a4	[tune][minor] Fixes (#1383 )	2018-01-11 18:14:20 -08:00
Philipp Moritz	44792530a9	fix autoscaler test (#1411 )	2018-01-10 13:18:34 -08:00
Devin Petersohn	112ef07563	Adding all DataFrame methods with NotImplementedErrors (#1403 ) * Adding all DataFrame methods with NotImplementedErrors * Moving dataframe creation into function call	2018-01-07 12:00:16 -08:00
Eric Liang	b6c42f96be	Auto-scale ray clusters based on GCS load metrics (#1348 ) This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.	2017-12-31 14:39:57 -08:00
Devin Petersohn	a75a473d7f	Add a distributed Dataframe API to Ray (#1330 ) * Adding dataframe object and minor APIs * Adding reduce functionality * Adding some print and making reduce work on current Ray * Cleanup * Added new functionality and docs. * Adding more functionality. * New functionality with older cleanup * Complying with flake8 formatting * Added tests and addressed reviewer comments * Complying with flake8. * Adding pandas to travis and requirements doc * Fixing flake8 failures * Fixing flake8 errors from imports * Fixing import error * Fixing import errors * Addressing reviewer comments * Addressing lint error	2017-12-20 09:31:22 -08:00
Eric Liang	47b1f02d3e	[rllib] Pull out multi-gpu optimizer as a generic class (#1313 )	2017-12-17 15:59:57 -08:00
Eric Liang	f5ea44338e	EC2 cluster setup scripts and initial version of auto-scaler (#1311 )	2017-12-15 23:56:39 -08:00
Eric Liang	fbf1806b8a	[tune] Clean up result logging: move out of /tmp, add timestamp (#1297 )	2017-12-15 14:19:08 -08:00
Robert Nishihara	f75b51d178	Register Common.error with local scheduler extension module. (#1316 ) * Register Common.error with local scheduler extension module. * Add test.	2017-12-13 11:55:54 -08:00
Peter Schafhalter	20d6b74aa6	[rllib] Added evaluation script to RLLib (#1295 )	2017-12-11 11:59:44 -08:00
Robert Nishihara	96463c680c	Allow actor methods to return multiple object IDs. (#1296 ) * Allow actor methods to return multiple object IDs. * Add test. * Fixes * Remove outdated comment. * Add comment and assert	2017-12-09 10:37:57 -08:00
Philipp Moritz	26125e1547	Fixing the jenkins tests (#1299 ) * trying to fix jenkins tests * comment out more tests * remove pytorch stuff * use non-monotonic clock (monotonic not supported on python 2.7) * whitespace	2017-12-07 17:03:58 -08:00
Eric Liang	2d543b6e19	[rllib] Refactor DQN to use an Evaluator abstraction (#1276 ) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.	2017-12-06 17:51:57 -08:00
Robert Nishihara	c21e189371	Allow scheduling with arbitrary user-defined resource labels. (#1236 ) * Enable scheduling with custom resource labels. * Fix. * Minor fixes and ref counting fix. * Linting * Use .data() instead of .c_str(). * Fix linting. * Fix ResourcesTest.testGPUIDs test by waiting for workers to start up. * Sleep in test so that all tasks are submitted before any completes.	2017-12-01 11:41:40 -08:00
Eric Liang	37831ae0c3	Add a nicer warning message when you pass the wrong thing to ray.wait() (#1239 ) * add warnings * fix python mode * Small changes and add tests. * Fix test failure.	2017-11-27 22:57:33 -08:00
Robert Nishihara	2865128df0	Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260 ) * Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines. * Fix linting.	2017-11-26 18:29:10 -08:00
Robert Nishihara	0b4961b161	Provide flag for setting redis maxclients. (#1257 ) * Add flag for attempting to increase ulimit -n and the redis maxclients. * Don't bother trying to set ulimit -n. * Fix linting. * Add basic test.	2017-11-26 18:25:55 -08:00
Robert Nishihara	7af5292646	Give error if a worker has a version mismatch for Python Ray, or clou… (#1245 ) * Give error if a worker has a version mismatch for Python Ray, or cloudpickle. * Check version when attaching driver to cluster. * Only do check if the version info is present. * Bug fix. * Fix typo.	2017-11-23 23:31:03 -08:00
Robert Nishihara	477a40f76d	Prohibit returning actor handles and also update actor documentation. (#1246 ) * Prohibit returning actor handles and also update actor documentation. * Clarify documentation.	2017-11-23 09:37:24 -08:00

1 2 3 4 5 ...

336 commits