hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-08 19:41:38 -05:00

Author	SHA1	Message	Date
Robert Nishihara	faa31ae018	Introduce concept of resources required for placing a task. (#2837 ) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java	2018-10-04 10:35:39 -07:00
Richard Liaw	01bb073569	Suppress errors when worker or driver intentionally disconnects. (#2935 )	2018-10-04 00:06:34 -07:00
Si-Yuan	cc7e2ecdd5	Change logfile names and also allow plasma store socket to be passed in. (#2862 )	2018-10-03 10:03:53 -07:00
Robert Nishihara	3ce8eb2d4c	Test dying_worker_get and dying_worker_wait for xray. (#2997 ) This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.	2018-10-02 00:08:47 -07:00
Eric Liang	e4bea8d10e	[rllib] Default to truncate_episodes and add some more config validators (#2967 ) * update * link it * warn about truncation * fix * Update rllib-training.rst * deprecate tests failing	2018-09-30 18:37:55 -07:00
Robert Nishihara	ed6289771a	Convert runtest.py to use pytest. (#2966 ) * Convert runtest.py to use pytest. * Linting. * Fix * Fix * Fix * Fix	2018-09-30 07:59:44 -07:00
Eric Liang	747253e0f6	[rllib] Don't shuffle samples in PPO when using lstm	2018-09-30 01:13:56 -07:00
Eric Liang	3267676994	[Experimental] Add experimental distributed SGD API (#2858 ) * check in sgd api * idx * foreach_worker foreach_model * add feed_dict * update * yapf * typo * lint * plasma op change * fix plasma op * still not working * fix * fix * comments * yapf * silly flake8 * small test	2018-09-19 21:12:37 -07:00
Eric Liang	3a3782c39f	[rllib] Fix LSTM regression on truncated sequences and add regression test (#2898 ) * fix * add test * yapf * yapf * fix space * Oops that should be lstm: True * Update cartpole_lstm.py	2018-09-18 15:09:16 -07:00
Robert Nishihara	ea9d1cc887	Remove dependence on psutil. Add utility functions for getting system memory. (#2892 )	2018-09-18 15:03:29 +08:00
Hanwei Jin	dc76e51a60	bugfix: cmake copy plasma java lib from lib64 directory in centos (#2885 )	2018-09-16 22:32:09 -07:00
Robert Nishihara	f16d33593b	Mark worker as blocked and trigger reconstruction in ray.wait. (#2864 ) * Trigger reconstruction in ray.wait and mark worker as blocked. * Add test. * Linting. * Don't run new test with legacy Ray. * Only call HandleClientUnblocked if it actually blocked in ray.wait. * Reduce time to ray.wait in the test.	2018-09-13 15:28:17 -07:00
Hanwei Jin	fbf214e408	update ray cmake build process (#2853 ) * use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance * support boost external project, avoid using the system or build.sh boost * keep compatible with build.sh, remove boost and arrow build from it. * bugfix: parquet bison version control, plasma_java lib install problem * bugfix: cmake, do not compile plasma java client if no need * bugfix: component failures test timeout machenism has problem for plasma manager failed case * bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master * revert some fix * set arrow python executable, fix format error in component_failures_test.py * make clean arrow python build directory * update cmake code style, back to support cmake minimum version 3.4	2018-09-12 11:19:33 -07:00
old-bear	f3c1194be3	[tune] Add AutoML algorithm of GeneticSearcher (#2699 ) Add new search algorithm (genetic) along with the base framework of the searcher (which performs some basic jobs such as logging, recording and organizing in our project). Note that this is the initial commit. In the following days, we will add example, UT, and other refinements.	2018-09-12 09:17:04 -07:00
Eric Liang	611259b2c7	Re-raise actor initialization errors on method invocation (#2843 ) If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.	2018-09-10 10:51:19 -07:00
Robert Nishihara	bd64c940e9	Push error to driver when monitor raises an exception. (#2834 )	2018-09-07 17:42:45 -07:00
Zhijun Fu	753ba76141	[Issue 2809][xray] Cleanup on driver detach (#2826 ) This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass. The following should happen when a driver exits (either gracefully or ungracefully). #2797 should be enabled and pass. Any actors created by the driver that are still running should be killed. Any workers running tasks for the driver should be killed. Any tasks for the driver in any node_manager queues should be removed. Any future tasks received by a node manager for the driver should be ignored. The driver death notification should only be received once.	2018-09-07 16:11:32 +08:00
Robert Nishihara	3f6ed537a4	Add ray.is_initialized() function. (#2818 ) * Add ray.is_initialized() function. * Add assert.	2018-09-06 21:20:59 -07:00
Eric Liang	995ac24a2c	[rllib] clarify train batch size for PPO (#2793 ) It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.	2018-09-05 12:06:13 -07:00
Eric Liang	df4788e501	[rllib/tune] Add test for fractional gpu support in xray mode; add rllib support for fractional gpu (#2768 ) * frac gpu * doc * Update rllib-training.rst * yapf * remove xray	2018-09-03 11:12:23 -07:00
Eric Liang	b37a283053	[rllib] support local mode (#2795 )	2018-09-02 23:02:19 -07:00
Robert Nishihara	0ac855e061	Push errors to all drivers when node is marked dead. (#2808 ) * Push errors to all drivers when node is marked dead. * Fix	2018-09-02 20:04:58 -07:00
Robert Nishihara	c71bbbc3af	Add test (currently skipped) that drivers release resources when exiting. (#2797 ) * Add test (currently skipped) that drivers release resources when exiting. * Add test for ungraceful driver exit. * Small fix. * Small fix	2018-09-02 17:34:48 -07:00
Robert Nishihara	1c50082498	Re-enable sharded monitor test for xray, convert to pytest. (#2804 )	2018-09-01 19:53:40 -07:00
Alexey Tumanov	fdc9688226	[xray] push warning to driver for infeasible tasks (#2784 ) This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.	2018-09-01 13:21:27 -07:00
Yucong He	5b45f0bdff	[xray] Implementing Gcs sharding (#2409 ) Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.). [+] Implement sharding for gcs tables. [+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented. [+] Move AsyncGcsClient's initialization into Connect function. [-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.	2018-08-31 15:54:30 -07:00
Robert Nishihara	eda6ebb87d	Convert some unittests to pytest. (#2779 ) * Convert multi_node_test.py to pytest. * Convert array_test.py to pytest. * Convert failure_test.py to pytest. * Convert microbenchmarks to pytest. * Convert component_failures_test.py to pytest and some minor quotes changes. * Convert tensorflow_test.py to pytest. * Convert actor_test.py to pytest. * Fix. * Fix	2018-08-31 11:24:15 -07:00
Richard Liaw	0347e6418b	[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708 )	2018-08-30 16:18:56 -07:00
Robert Nishihara	32f7d6fcf5	Add back some tests for xray. (#2772 )	2018-08-30 11:07:23 -07:00
Robert Nishihara	132f133214	Limit number of concurrent workers started by hardware concurrency. (#2753 ) * Limit number of concurrent workers started by hardware concurrency. * Check if std:🧵:hardware_concurrency() returns 0. * Pass in max concurrency from Python. * Fix Java call to startRaylet. * Fix typo * Remove unnecessary cast. * Fix linting. * Cleanups on Java side. * Comment back in actor test. * Require maximum_startup_concurrency to be at least 1. * Fix linting and test. * Improve documentation. * Fix typo.	2018-08-29 14:53:40 +08:00
Robert Nishihara	b7722897b4	Deprecate 'driver_mode' argument. (#2758 ) * Deprecate 'driver_mode' argument. * Fix * Fix	2018-08-28 16:45:49 -07:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Richard Liaw	dbba7f2a53	[autoscaler] Cleanup Logging (#2709 ) Moves Autoscaler onto Python `logging` module.	2018-08-25 17:08:45 -07:00
Eric Liang	fbe6c59f72	[rllib] Misc fixes, A2C (#2679 ) A bunch of minor rllib fixes: pull in latest baselines atari wrapper changes (and use deepmind wrapper by default) move reward clipping to policy evaluator add a2c variant of a3c reduce vision network fc layer size to 256 units switch to 84x84 images doc tweaks print timesteps in tune status	2018-08-20 15:28:03 -07:00
Robert Nishihara	aaf5456b3d	Add test that tasks sent to actor on dead node raise exceptions. (#2626 ) * Add actor failure test. * Minor change. * Make test harder. * Change numbers a bit. * Skip test for non xray.	2018-08-16 22:48:31 -07:00
Eric Liang	6670880f03	[rllib] Workaround actor creation hang edge case for ape-X (#2661 ) * apex hang * fix * move pyt to end	2018-08-16 18:03:50 -07:00
Yuhong Guo	eeb15771ba	Add `ray.internal.free` (#2542 )	2018-08-14 22:01:23 -07:00
Stephanie Wang	806fdf2f05	[xray] Object manager retries Pull requests (#2630 ) * Move all ObjectManager members to bottom of class def * Better Pull requests - suppress duplicate Pulls - retry the Pull at the next client after a timeout - cancel a Pull if the object no longer appears on any clients * increase object manager Pull timeout * Make the component failure test harder. * note * Notify SubscribeObjectLocations caller of empty list * Address melih's comments * Fix wait... * Make component failure test easier for legacy ray * lint	2018-08-13 19:15:55 -07:00
Stephanie Wang	4a7be6f46d	[xray] Make sure raylet does not crash if remote raylet dies (#2619 ) * Log a warning on remote object manager failures * Mark a task that was failed to be forwarded as pending * Raylet component failure test and make it harder * Turn on component failure test for xray * Remove return status from ReleaseSender * lint	2018-08-09 20:36:30 -07:00
Stephanie Wang	d49b4bef0a	[xray] Basic task reconstruction mechanism (#2526 ) ## What do these changes do? This implements basic task reconstruction in raylet. There are two parts to this PR: 1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary. 2. Task resubmission once a raylet becomes responsible for reconstructing a task. Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this: 1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR. 2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted). Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.	2018-08-09 07:24:37 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Stephanie Wang	6ab01a2cad	[xray] Fix bug when counting a task's lineage size (#2600 )	2018-08-08 00:00:17 -07:00
Yuhong Guo	9825da7233	Change training tasks to xray for Jenkins tests (#2567 )	2018-08-06 13:35:26 -07:00
Yuhong Guo	d2ebe4d9a3	Fix frequent failure of Jenkins CI. (#2490 )	2018-08-02 10:28:28 -07:00
Philipp Moritz	d8ba667175	Convert asserts in unittest to pytest (#2529 )	2018-08-01 22:32:10 -07:00
Eric Liang	9ea57c2a93	[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504 ) Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer Add AsyncSamplesOptimizer that implements the IMPALA architecture integrate V-trace with a3c policy graph audit V-trace integration benchmark compare vs A3C and with V-trace on/off PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.	2018-08-01 20:53:53 -07:00
Robert Nishihara	909d7172b1	Introduce constant for ID_SIZE in python code. (#2517 )	2018-07-31 12:40:53 -07:00
Eric Liang	38d00986a5	[rllib] Cleanups: deep merge configs properly; enforce min iter time on APEX (#2500 ) The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.	2018-07-30 13:25:35 -07:00
Philipp Moritz	696a229ece	Fix text verbosity in python 2.7 by running tests with pytest (#2470 )	2018-07-30 11:04:06 -07:00
Robert Nishihara	2be1ccbd8f	Raise application-level exceptions for some failure scenarios. (#2429 ) * Raise application level exception for actor methods that can't be executed and failed tasks. * Retry task forwarding for actor tasks. * Small cleanups * Move constant to ray_config. * Create ForwardTaskOrResubmit method. * Minor * Clean up queued tasks for dead actors. * Some cleanups. * Linting * Notify task_dependency_manager_ about failed tasks. * Manage timer lifetime better. * Use smart pointers to deallocate the timer. * Fix * add comment	2018-07-27 19:53:30 -04:00

1 2 3 4 5 ...

561 commits