hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 12:56:46 -04:00

Author	SHA1	Message	Date
Robert Nishihara	eda6ebb87d	Convert some unittests to pytest. (#2779 ) * Convert multi_node_test.py to pytest. * Convert array_test.py to pytest. * Convert failure_test.py to pytest. * Convert microbenchmarks to pytest. * Convert component_failures_test.py to pytest and some minor quotes changes. * Convert tensorflow_test.py to pytest. * Convert actor_test.py to pytest. * Fix. * Fix	2018-08-31 11:24:15 -07:00
Richard Liaw	0347e6418b	[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708 )	2018-08-30 16:18:56 -07:00
Robert Nishihara	32f7d6fcf5	Add back some tests for xray. (#2772 )	2018-08-30 11:07:23 -07:00
Robert Nishihara	132f133214	Limit number of concurrent workers started by hardware concurrency. (#2753 ) * Limit number of concurrent workers started by hardware concurrency. * Check if std:🧵:hardware_concurrency() returns 0. * Pass in max concurrency from Python. * Fix Java call to startRaylet. * Fix typo * Remove unnecessary cast. * Fix linting. * Cleanups on Java side. * Comment back in actor test. * Require maximum_startup_concurrency to be at least 1. * Fix linting and test. * Improve documentation. * Fix typo.	2018-08-29 14:53:40 +08:00
Robert Nishihara	b7722897b4	Deprecate 'driver_mode' argument. (#2758 ) * Deprecate 'driver_mode' argument. * Fix * Fix	2018-08-28 16:45:49 -07:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Richard Liaw	dbba7f2a53	[autoscaler] Cleanup Logging (#2709 ) Moves Autoscaler onto Python `logging` module.	2018-08-25 17:08:45 -07:00
Eric Liang	fbe6c59f72	[rllib] Misc fixes, A2C (#2679 ) A bunch of minor rllib fixes: pull in latest baselines atari wrapper changes (and use deepmind wrapper by default) move reward clipping to policy evaluator add a2c variant of a3c reduce vision network fc layer size to 256 units switch to 84x84 images doc tweaks print timesteps in tune status	2018-08-20 15:28:03 -07:00
Robert Nishihara	aaf5456b3d	Add test that tasks sent to actor on dead node raise exceptions. (#2626 ) * Add actor failure test. * Minor change. * Make test harder. * Change numbers a bit. * Skip test for non xray.	2018-08-16 22:48:31 -07:00
Eric Liang	6670880f03	[rllib] Workaround actor creation hang edge case for ape-X (#2661 ) * apex hang * fix * move pyt to end	2018-08-16 18:03:50 -07:00
Yuhong Guo	eeb15771ba	Add `ray.internal.free` (#2542 )	2018-08-14 22:01:23 -07:00
Stephanie Wang	806fdf2f05	[xray] Object manager retries Pull requests (#2630 ) * Move all ObjectManager members to bottom of class def * Better Pull requests - suppress duplicate Pulls - retry the Pull at the next client after a timeout - cancel a Pull if the object no longer appears on any clients * increase object manager Pull timeout * Make the component failure test harder. * note * Notify SubscribeObjectLocations caller of empty list * Address melih's comments * Fix wait... * Make component failure test easier for legacy ray * lint	2018-08-13 19:15:55 -07:00
Stephanie Wang	4a7be6f46d	[xray] Make sure raylet does not crash if remote raylet dies (#2619 ) * Log a warning on remote object manager failures * Mark a task that was failed to be forwarded as pending * Raylet component failure test and make it harder * Turn on component failure test for xray * Remove return status from ReleaseSender * lint	2018-08-09 20:36:30 -07:00
Stephanie Wang	d49b4bef0a	[xray] Basic task reconstruction mechanism (#2526 ) ## What do these changes do? This implements basic task reconstruction in raylet. There are two parts to this PR: 1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary. 2. Task resubmission once a raylet becomes responsible for reconstructing a task. Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this: 1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR. 2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted). Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.	2018-08-09 07:24:37 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Stephanie Wang	6ab01a2cad	[xray] Fix bug when counting a task's lineage size (#2600 )	2018-08-08 00:00:17 -07:00
Yuhong Guo	9825da7233	Change training tasks to xray for Jenkins tests (#2567 )	2018-08-06 13:35:26 -07:00
Yuhong Guo	d2ebe4d9a3	Fix frequent failure of Jenkins CI. (#2490 )	2018-08-02 10:28:28 -07:00
Philipp Moritz	d8ba667175	Convert asserts in unittest to pytest (#2529 )	2018-08-01 22:32:10 -07:00
Eric Liang	9ea57c2a93	[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504 ) Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer Add AsyncSamplesOptimizer that implements the IMPALA architecture integrate V-trace with a3c policy graph audit V-trace integration benchmark compare vs A3C and with V-trace on/off PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.	2018-08-01 20:53:53 -07:00
Robert Nishihara	909d7172b1	Introduce constant for ID_SIZE in python code. (#2517 )	2018-07-31 12:40:53 -07:00
Eric Liang	38d00986a5	[rllib] Cleanups: deep merge configs properly; enforce min iter time on APEX (#2500 ) The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.	2018-07-30 13:25:35 -07:00
Philipp Moritz	696a229ece	Fix text verbosity in python 2.7 by running tests with pytest (#2470 )	2018-07-30 11:04:06 -07:00
Robert Nishihara	2be1ccbd8f	Raise application-level exceptions for some failure scenarios. (#2429 ) * Raise application level exception for actor methods that can't be executed and failed tasks. * Retry task forwarding for actor tasks. * Small cleanups * Move constant to ray_config. * Create ForwardTaskOrResubmit method. * Minor * Clean up queued tasks for dead actors. * Some cleanups. * Linting * Notify task_dependency_manager_ about failed tasks. * Manage timer lifetime better. * Use smart pointers to deallocate the timer. * Fix * add comment	2018-07-27 19:53:30 -04:00
Shuo	29451cca82	Add test: running a driver for twice. (#2464 )	2018-07-27 00:57:52 -07:00
Eric Liang	68660453e4	[rllib] Better support and add two-trainer example for multiagent (#2443 ) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future.	2018-07-22 05:09:25 -07:00
Hao Chen	05f485e274	Allow Ray API to be used from multiple threads (#2422 )	2018-07-20 15:39:01 -07:00
Eric Liang	807f309b3a	[test] Fix broken rllib test (#2446 ) This fixes the broken build.	2018-07-20 13:47:41 -07:00
Eric Liang	8e75d150f7	[rllib] Apex crash when compress_observations: False (#2426 ) We shouldn't try to decompress uncompressed data. Also, fix resource requests for ddpg + GPU.	2018-07-19 15:58:09 -07:00
Richard Liaw	8e8c733696	[tune] Fix Categorical Space + Add Keras Example (#2401 ) Previously did not properly resolve categorical variables for HyperOpt.	2018-07-17 23:52:52 +02:00
Eric Liang	0cecf6b79c	[rllib] Cleanup RNN support and make it work with multi-GPU optimizer (#2394 ) Cleanup: TFPolicyGraph now automatically adds loss input entries for state_in_*, so that graph sub-classes don't need to worry about it. Multi-GPU support: Allow setting up model tower replicas with existing state input tensors Truncate the per-device minibatch slices so that they are always a multiple of max_seq_len.	2018-07-17 06:55:46 +02:00
Robert Nishihara	515da7721a	Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (#2374 ) * Change ray.worker.cleanup -> ray.shutdown and improve API documentation. * Deprecate ray.worker.cleanup() gracefully. * Fix linting	2018-07-12 12:00:00 -07:00
Eric Liang	b316afeb43	[rllib] Add debug info back to PPO and fix optimizer compatibility (#2366 )	2018-07-12 19:22:46 +02:00
Richard Liaw	0048e77093	[rllib] RLlib CLI (#2375 )	2018-07-12 19:12:04 +02:00
Robert Nishihara	54487b1d7f	Pin the number of CPUs in failing actor test. (#2368 ) * Pin the number of CPUs in failing actor test. * Pin number of CPUs in multi_node_test.py. * Fix linting.	2018-07-11 18:34:19 -07:00
Richard Liaw	4d7da9f668	[rllib] Remove "Common", cleanup some code (#2348 )	2018-07-08 13:03:53 -07:00
Robert Nishihara	e3534c46df	[xray] Re-enable some stress tests and convert stress_tests to pytest. (#2285 ) * Fix one of the stress tests, fix ray.global_state.client_table when called early on. * Re-enable testWait. * Convert stress_tests.py to pytest. * Fix	2018-07-06 23:21:00 -07:00
Robert Nishihara	b90e551b41	[xray] Implement timeline and profiling API. (#2306 ) * Add profile table and store profiling information there. * Code for dumping timeline. * Improve color scheme. * Push timeline events on driver only for raylet. * Improvements to profiling and timeline visualization * Some linting * Small fix. * Linting * Propagate node IP address through profiling events. * Fix test. * object_id.hex() should return byte string in python 2. * Include gcs.fbs in node_manager.fbs. * Remove flatbuffer definition duplication. * Decode to unicode in Python 3 and bytes in Python 2. * Minor * Submit profile events in a batch. Revert some CMake changes. * Fix * Workaround test failure. * Fix linting * Linting * Don't return anything from chrome_tracing_dump when filename is provided. * Remove some redundancy from profile table. * Linting * Move TODOs out of docstring. * Minor	2018-07-04 23:23:48 -07:00
Richard Liaw	f0ed1c1674	[rllib] Add more regression tests and autogenerate (#2324 )	2018-07-02 08:20:53 -07:00
Eric Liang	8aa56c12e6	[rllib] Document "v2" APIs (#2316 ) * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * wip * tf * update * fix * cleanup * cleanup * spacing * model * fix * dqn * fix ddpg * doc * keep names * update * fix * com * docs * clarify model outputs * Update torch_policy_graph.py * fix obs filter * pass thru worker index * fix * rename * vlad torch comments * fix log action * debug name * fix lstm * remove unused ddpg net * remove conv net * revert lstm * wip * wip * cast * wip * works * fix a3c * works * lstm util test * doc * clean up * update * fix lstm check * move to end * fix sphinx * fix cmd * remove bad doc * envs * vec * doc prep * models * rl * alg * up * clarify * copy * async sa * fix * comments * fix a3c conf * tune lstm * fix reshape * fix * back to 16 * tuned a3c update * update * tuned * optional * merge * wip * fix up * move pg class * rename env * wip * update * tip * alg * readme * fix catalog * readme * doc * context * remove prep * comma * add env * link to paper * paper * update * rnn * update * wip * clean up ev creation * fix * fix * fix * fix lint * up * no comma * ma * Update run_multi_node_tests.sh * fix * sphinx is stupid * sphinx is stupid * clarify torch graph * no horizon * fix config * sb * Update test_optimizers.py	2018-07-01 00:05:08 -07:00
Philipp Moritz	762bdf646e	[xray] Put GCS data into the redis data shard (#2298 )	2018-06-30 15:42:10 -10:00
Richard Liaw	3cc27d2840	[rllib][asv] Support ASV for RLlib (#2304 )	2018-06-28 17:20:09 -07:00
Adam Gleave	89460b8d11	autoscaler: count head node, don't kill below target (fixes #2317 ) (#2320 ) Specifically, subtracts 1 from the target number of workers, taking into account that the head node has some computational resources. Do not kill an idle node if it would drop us below the target number of nodes (in which case we just immediately relaunch).	2018-06-28 15:33:51 -07:00
Eric Liang	b197c0c404	[rllib] General RNN support (#2299 ) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * wip * tf * update * fix * cleanup * cleanup * spacing * model * fix * dqn * fix ddpg * doc * keep names * update * fix * com * docs * clarify model outputs * Update torch_policy_graph.py * fix obs filter * pass thru worker index * fix * rename * vlad torch comments * fix log action * debug name * fix lstm * remove unused ddpg net * remove conv net * revert lstm * wip * wip * cast * wip * works * fix a3c * works * lstm util test * doc * clean up * update * fix lstm check * move to end * fix sphinx * fix cmd * remove bad doc * clarify * copy * async sa * fix * comments * fix a3c conf * tune lstm * fix reshape * fix * back to 16 * tuned a3c update * update * tuned * optional * fix catalog * remove prep	2018-06-27 22:51:04 -07:00
Eric Liang	1251abf0d1	[rllib] Modularize Torch and TF policy graphs (#2294 ) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * wip * tf * update * fix * cleanup * cleanup * spacing * model * fix * dqn * fix ddpg * doc * keep names * update * fix * com * docs * clarify model outputs * Update torch_policy_graph.py * fix obs filter * pass thru worker index * fix * rename * vlad torch comments * fix log action * debug name * fix lstm * remove unused ddpg net * remove conv net * revert lstm * cast * clean up * fix lstm check * move to end * fix sphinx * fix cmd * remove bad doc * clarify * copy * async sa * fix	2018-06-26 13:17:15 -07:00
Eric Liang	a9a26b7560	[rllib] Part 2 of multiagent support (#2286 ) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * fix obs filter * pass thru worker index * fix * fix log action * debug name * fix sphinx	2018-06-25 22:33:57 -07:00
Robert Nishihara	800f7cc77d	Make actor handles work in Python mode. (#2283 ) * Make actor handles work in local mode. * Add test for actor handles in local mode.	2018-06-20 23:02:41 -07:00
Robert Nishihara	ff2217251f	[xray] Add error table and push error messages to driver through node manager. (#2256 ) * Fix documentation indentation. * Add error table to GCS and push error messages through node manager. * Add type to error data. * Linting * Fix failure_test bug. * Linting. * Enable one more test. * Attempt to fix doc building. * Restructuring * Fixes * More fixes. * Move current_time_ms function into util.h.	2018-06-20 21:29:28 -07:00
Robert Nishihara	18ee044f03	Re-enable some actor tests. (#2276 )	2018-06-20 14:42:35 -07:00
Zongheng Yang	8190ff1fd0	Experimental: enable automatic GCS flushing with configurable policy. (#2266 ) * build_credis.sh: use an up-to-date credis commit. * build_credis.sh: leveldb is updated, so update build cmds for it * WIP: make monitor.py issue flush; switch gcs client to use credis * Experimental: enable automatic GCS flushing with configurable policy. * Fix linux compilation error * Fix leveldb build * Use optimized build for credis * Address comments * Attempt to fix tests	2018-06-20 14:40:57 -07:00

1 2 3 4 5 ...

435 commits