hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Eric Liang	b6c42f96be	Auto-scale ray clusters based on GCS load metrics (#1348 ) This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.	2017-12-31 14:39:57 -08:00
Robert Nishihara	e970e24ea5	Update arrow, and pass memcopy_threads into put. (#1374 )	2017-12-31 13:32:06 -08:00
Richard Liaw	3304099cc4	[rllib] Evaluators and Optimizers Refactoring (#1339 )	2017-12-30 00:24:54 -08:00
Eric Liang	22c7c87e14	[rllib] [tune] Custom preprocessors and models, various fixes (#1372 )	2017-12-28 13:19:04 -08:00
Philipp Moritz	3d224c4edf	Second Part of Internal API Refactor (#1326 )	2017-12-26 16:22:04 -08:00
Richard Liaw	4bb5b6bd5b	[rllib] A3C Configurations (#1370 ) * initial introduction of a3c configs * fix sample batch * flake but need to check save * save,resotre * fix * pickles * entropy * fix * moving ppo * results * jenkins	2017-12-24 12:25:13 -08:00
Richard Liaw	b217a5ef14	[rllib] Fix Pong-PPO tuned example Config (#1369 )	2017-12-23 01:36:33 -08:00
Eric Liang	43e78217f8	Thu Dec 21 23:19:24 PST 2017 (#1367 )	2017-12-22 17:29:45 -08:00
Robert Nishihara	22460ff7af	Use Anaconda for autoscaling example and add example config for devel… (#1361 ) * Use Anaconda for autoscaling example and add example config for development. * Install Python2 for building the web ui.	2017-12-22 01:59:02 -08:00
Eric Liang	0ae660ce4e	[carla] In carla example, save all images and measurements to local disk (#1350 ) * revamp saving * smaller jpgs * hide verbose * Tue Dec 19 22:25:01 PST 2017 * make sure temp dirs sort lexiographically * save total reward too * zero pad i * 160x160 dqn * ever higher res dqn	2017-12-21 15:19:55 -08:00
Philipp Moritz	3a301c3d56	Fix pyarrow version check (#1360 )	2017-12-21 13:00:36 -08:00
Devin Petersohn	a75a473d7f	Add a distributed Dataframe API to Ray (#1330 ) * Adding dataframe object and minor APIs * Adding reduce functionality * Adding some print and making reduce work on current Ray * Cleanup * Added new functionality and docs. * Adding more functionality. * New functionality with older cleanup * Complying with flake8 formatting * Added tests and addressed reviewer comments * Complying with flake8. * Adding pandas to travis and requirements doc * Fixing flake8 failures * Fixing flake8 errors from imports * Fixing import error * Fixing import errors * Addressing reviewer comments * Addressing lint error	2017-12-20 09:31:22 -08:00
Cathy Wu	772527caa4	[rllib] Support 1-dimensional action spaces (PPO) (#1347 ) * Small fix for supporting custom preprocessors * PEP8 * Remove squeeze from actions	2017-12-19 14:17:06 -08:00
Eric Liang	6724f57b03	[Examples] Add Carla test env (#1343 ) * add carla example * add reward * set obs * Sun Dec 17 16:06:00 PST 2017 * add spec * fix measurement * add train script * resize to 80x80 * null * initial small training run * robustify env, clean up action space * clean up vars * switch to town2 which is faster * tunify train.py * add discrete mode * update * fix excessive brakinG * fix the weather * rename * redirect output and from future import * doc * update * fix rebase * allow dqn gpu growht * adjust dqn hyperparams * better ppo parameters	2017-12-19 12:57:58 -08:00
Melih Elibol	24b93b1123	fixes default type for product of empty shape. (#1341 )	2017-12-18 17:41:44 -08:00
Eric Liang	47b1f02d3e	[rllib] Pull out multi-gpu optimizer as a generic class (#1313 )	2017-12-17 15:59:57 -08:00
Cathy Wu	53e736fe01	[rllib] Small fix for supporting custom preprocessors (#1334 ) * Small fix for supporting custom preprocessors * PEP8 * fix test	2017-12-17 04:37:29 -08:00
Eric Liang	bab44837e0	[tune] Tensorboard logger incorrectly reports training iteration as cur timestep value	2017-12-16 23:30:15 -08:00
Eric Liang	d21ea0ca45	Switch EC2 example config to use AWS deep learning AMI + latest Ray wheel (#1331 ) * update * install --user	2017-12-16 17:39:46 -08:00
Eric Liang	f5ea44338e	EC2 cluster setup scripts and initial version of auto-scaler (#1311 )	2017-12-15 23:56:39 -08:00
Eric Liang	fbf1806b8a	[tune] Clean up result logging: move out of /tmp, add timestamp (#1297 )	2017-12-15 14:19:08 -08:00
Stephanie Wang	12fdb3f53a	Convert actor dummy objects to task execution edges. (#1281 ) * Define execution dependencies flatbuffer and add to Redis commands * Convert TaskSpec to TaskExecutionSpec * Add execution dependencies to Python bindings * Submitting actor tasks uses execution dependency API instead of dummy argument * Fix dependency getters and some cleanup for fetching missing dependencies * C++ convention * Make TaskExecutionSpec a C++ class * Convert local scheduler to use TaskExecutionSpec class * Convert some pointers to references * Finish conversion to TaskExecutionSpec class * fix * Fix * Fix memory errors? * Cast flatbuffers GetSize to size_t * Fixes * add more retries in global scheduler unit test * fix linting and cast fbb.GetSize to size_t * Style and doc * Fix linting and simplify from_flatbuf.	2017-12-14 20:47:54 -08:00
Richard Liaw	c5c83a4465	[rllib] PPO and A3C unification (#1253 )	2017-12-14 01:08:23 -08:00
Richard Liaw	cabbd27c56	[rllib] Support Nested Configuration Merging (#1268 )	2017-12-13 14:39:01 -08:00
Robert Nishihara	f75b51d178	Register Common.error with local scheduler extension module. (#1316 ) * Register Common.error with local scheduler extension module. * Add test.	2017-12-13 11:55:54 -08:00
Richard Liaw	b6a35e0395	[rllib] Introduce pip install rllib (#1310 ) * update setup * more dependencies	2017-12-12 13:58:28 -08:00
Robert Nishihara	b1d89026cd	Make ActorMethod fields private to fix tab completion. (#1312 )	2017-12-12 10:07:33 -08:00
Peter Schafhalter	20d6b74aa6	[rllib] Added evaluation script to RLLib (#1295 )	2017-12-11 11:59:44 -08:00
Robert Nishihara	96c46d35ff	Tell Ray how to serialize FunctionSignature objects. (#1308 )	2017-12-10 22:40:28 -08:00
Eric Liang	7009538321	Autodetect the number of GPUs when starting Ray. (#1293 ) * autodetect * Wed Dec 6 12:46:52 PST 2017 * Wed Dec 6 12:47:54 PST 2017 * Move GPU autodetection into services.py. * Fix capitalization of Nvidia. * Update documentation.	2017-12-09 15:30:16 -08:00
Robert Nishihara	6aae9a12fb	Improve version checking at startup. (#1307 ) * Check pyarrow version at startup. * For version check, use absolute path to ray module.	2017-12-09 14:20:56 -08:00
Robert Nishihara	96463c680c	Allow actor methods to return multiple object IDs. (#1296 ) * Allow actor methods to return multiple object IDs. * Add test. * Fixes * Remove outdated comment. * Add comment and assert	2017-12-09 10:37:57 -08:00
Zongheng Yang	7e4a28f933	[rllib] Add tuned_examples/pong-ppo.yaml (#1302 ) * Add tuned_examples/pong-ppo.yaml: 21 rew in ~3380sec * Header comments	2017-12-09 01:20:22 -08:00
John Schulman	2606001a36	allow users to disable the webui (#1306 ) * allow users to disable the webui * Remove trailing whitespace.	2017-12-09 00:35:55 -08:00
Robert Nishihara	5adbdfecd0	Raise exception if pyarrow is imported before ray. (#1283 ) * Raise exception if pyarrow is imported before ray. * Pip install pyarrow when building doc so we don't have to mock it. * Raise ImportError instead of Exception.	2017-12-08 03:34:54 -08:00
Richard Liaw	2e0eb0e4c7	[rllib] Adding dependencies (#1298 )	2017-12-08 01:57:19 -08:00
Philipp Moritz	26125e1547	Fixing the jenkins tests (#1299 ) * trying to fix jenkins tests * comment out more tests * remove pytorch stuff * use non-monotonic clock (monotonic not supported on python 2.7) * whitespace	2017-12-07 17:03:58 -08:00
Eric Liang	35f7398666	[rllib] Update RLlib docs and README (#1288 ) Updates the rllib docs and README.	2017-12-06 18:17:51 -08:00
Eric Liang	2d543b6e19	[rllib] Refactor DQN to use an Evaluator abstraction (#1276 ) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.	2017-12-06 17:51:57 -08:00
Robert Nishihara	c21e189371	Allow scheduling with arbitrary user-defined resource labels. (#1236 ) * Enable scheduling with custom resource labels. * Fix. * Minor fixes and ref counting fix. * Linting * Use .data() instead of .c_str(). * Fix linting. * Fix ResourcesTest.testGPUIDs test by waiting for workers to start up. * Sleep in test so that all tasks are submitted before any completes.	2017-12-01 11:41:40 -08:00
Richard Liaw	483dee2ff3	[rllib] Generalizing A3C Sampling Classes (#1250 )	2017-11-30 00:22:25 -08:00
Robert Nishihara	dd45664ab5	Bump version number to 0.3.0. (#1247 )	2017-11-27 23:02:29 -08:00
Eric Liang	37831ae0c3	Add a nicer warning message when you pass the wrong thing to ray.wait() (#1239 ) * add warnings * fix python mode * Small changes and add tests. * Fix test failure.	2017-11-27 22:57:33 -08:00
Robert Nishihara	c1496b8111	Check version info in ray start for non-head nodes. (#1264 ) * Check version info in ray start for non-head nodes. * Small fix. * Fix * Push error to all drivers when worker has version mismatch. * Linting * Linting * Fix * Unify methods. * Fix bug.	2017-11-27 22:03:38 -08:00
Richard Liaw	5e37cb8e16	Small PPO bug (#1265 )	2017-11-27 17:52:25 -08:00
Robert Nishihara	f7c4f41df8	Change Python Redis client psubscribe -> subscribe. (#1261 )	2017-11-26 23:29:37 -08:00
Robert Nishihara	2865128df0	Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260 ) * Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines. * Fix linting.	2017-11-26 18:29:10 -08:00
Robert Nishihara	0b4961b161	Provide flag for setting redis maxclients. (#1257 ) * Add flag for attempting to increase ulimit -n and the redis maxclients. * Don't bother trying to set ulimit -n. * Fix linting. * Add basic test.	2017-11-26 18:25:55 -08:00
Eric Liang	7fc2ddbaf7	Revert "[rllib] Use NoFilter instead of MeanStdFilter for PPO. (#1082 )" (#1255 ) This reverts commit `971becc905`.	2017-11-26 16:00:46 -08:00
Robert Nishihara	e583d5a421	Give warnings for unimplemented Python mode methods. (#1256 )	2017-11-26 13:11:12 -08:00

... 68 69 70 71 72 ...

3771 commits