hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 04:46:38 -04:00

Author	SHA1	Message	Date
Eric Liang	813f51769f	[rllib] Fix rllib rollouts script and add test (#3211 ) ## What do these changes do? Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py ## Related issue number https://github.com/ray-project/ray/issues/3206 https://github.com/ray-project/ray/issues/3204	2018-11-05 00:33:25 -08:00
Eric Liang	369cb833fe	[rllib] Implement custom metrics (#3144 )	2018-11-03 18:48:32 -07:00
Eric Liang	9a0f0db070	Add `ray stack` tool for debugging (#3213 )	2018-11-03 13:13:02 -07:00
Wang Qing	ca7d4c2cf5	Enable to specify driver id by user. (#3084 )	2018-11-02 19:01:50 -07:00
Robert Nishihara	5822aa2388	Rename get_task -> worker_idle in timeline. (#3179 ) * Rename get_task -> worker_idle in timeline. * Fix test.	2018-11-02 12:08:46 -07:00
Robert Nishihara	1f29a960f4	Update task_table and object_table API. (#3161 ) * Update task_table and object_table API. * Fix	2018-10-31 12:52:50 -07:00
Robert Nishihara	32f0d6b77e	Deprecate num_workers argument to ray.init and ray start. (#3114 ) * Remove num_workers argument. * Fix * Fix	2018-10-28 20:12:49 -07:00
Robert Nishihara	9868af4c7c	Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149 ) * Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. * Add logging statement and address comments. * Fix	2018-10-28 20:09:06 -07:00
Robert Nishihara	fd854ff090	Allow the node manager port and object manager port to be set through… (#3130 ) * Allow the node manager port and object manager port to be set through ray start. * Linting * Fix Java test * Address comments.	2018-10-28 17:28:41 -07:00
Eric Liang	af0c1174cd	[sgd] Merge sharded param server based SGD implementation (#3033 ) This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly. $ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \ --devices-per-worker=M --strategy=<simple\|ps> \ --warmup --object-store-memory=10000000000 Images per second total gpus total \| simple \| ps ======================================== 1 \| 218 2 (1 worker) \| 388 4 (1 worker) \| 759 4 (2 workers) \| 176 \| 623 8 (1 worker) \| 985 8 (2 workers) \| 349 \| 1031 16 (2 nodes, 2 workers) \| 600 \| 1661 16 (2 nodes, 4 workers) \| 468 \| 1712 <--- OSDI perf was 1817	2018-10-27 21:25:02 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Robert Nishihara	5aa29613db	Fix linting errors. (#3127 )	2018-10-24 16:30:00 -07:00
Robert Nishihara	9c1826ed69	Use XRay backend by default. (#3020 ) * Use XRay backend by default. * Remove irrelevant valgrind tests. * Fix * Move tests around. * Fix * Fix test * Fix test. * String/unicode fix. * Fix test * Fix unicode issue. * Minor changes * Fix bug in test_global_state.py. * Fix test. * Linting * Try arrow change and other object manager changes. * Use newer plasma client API * Small updates * Revert plasma client api change. * Update * Update arrow and allow SendObjectHeaders to fail. * Update arrow * Update python/ray/experimental/state.py Co-Authored-By: robertnishihara <robertnishihara@gmail.com> * Address comments.	2018-10-23 12:46:39 -07:00
Robert Nishihara	22dd7e0428	Add test for wait reconstruction. (#3110 )	2018-10-22 23:16:54 -07:00
Richard Liaw	40c4148d4f	Cluster Utilities for Fault Tolerance Tests (#3008 )	2018-10-20 22:56:29 -07:00
Eric Liang	59901a88a0	[rllib] Native support for Dict and Tuple spaces; fix Tuple action spaces; add prev a, r to LSTM (#3051 )	2018-10-20 15:21:22 -07:00
Philipp Moritz	2c52d9dfa0	Fix actor handle id creation when actor handle was pickled (#3074 )	2018-10-17 18:00:52 -07:00
Eric Liang	3c891c6ece	[rllib] Parallel-data loading and multi-gpu support for IMPALA (#2766 )	2018-10-15 11:02:50 -07:00
Robert Nishihara	faa31ae018	Introduce concept of resources required for placing a task. (#2837 ) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java	2018-10-04 10:35:39 -07:00
Richard Liaw	01bb073569	Suppress errors when worker or driver intentionally disconnects. (#2935 )	2018-10-04 00:06:34 -07:00
Si-Yuan	cc7e2ecdd5	Change logfile names and also allow plasma store socket to be passed in. (#2862 )	2018-10-03 10:03:53 -07:00
Robert Nishihara	3ce8eb2d4c	Test dying_worker_get and dying_worker_wait for xray. (#2997 ) This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.	2018-10-02 00:08:47 -07:00
Eric Liang	e4bea8d10e	[rllib] Default to truncate_episodes and add some more config validators (#2967 ) * update * link it * warn about truncation * fix * Update rllib-training.rst * deprecate tests failing	2018-09-30 18:37:55 -07:00
Robert Nishihara	ed6289771a	Convert runtest.py to use pytest. (#2966 ) * Convert runtest.py to use pytest. * Linting. * Fix * Fix * Fix * Fix	2018-09-30 07:59:44 -07:00
Eric Liang	747253e0f6	[rllib] Don't shuffle samples in PPO when using lstm	2018-09-30 01:13:56 -07:00
Eric Liang	3267676994	[Experimental] Add experimental distributed SGD API (#2858 ) * check in sgd api * idx * foreach_worker foreach_model * add feed_dict * update * yapf * typo * lint * plasma op change * fix plasma op * still not working * fix * fix * comments * yapf * silly flake8 * small test	2018-09-19 21:12:37 -07:00
Eric Liang	3a3782c39f	[rllib] Fix LSTM regression on truncated sequences and add regression test (#2898 ) * fix * add test * yapf * yapf * fix space * Oops that should be lstm: True * Update cartpole_lstm.py	2018-09-18 15:09:16 -07:00
Robert Nishihara	ea9d1cc887	Remove dependence on psutil. Add utility functions for getting system memory. (#2892 )	2018-09-18 15:03:29 +08:00
Hanwei Jin	dc76e51a60	bugfix: cmake copy plasma java lib from lib64 directory in centos (#2885 )	2018-09-16 22:32:09 -07:00
Robert Nishihara	f16d33593b	Mark worker as blocked and trigger reconstruction in ray.wait. (#2864 ) * Trigger reconstruction in ray.wait and mark worker as blocked. * Add test. * Linting. * Don't run new test with legacy Ray. * Only call HandleClientUnblocked if it actually blocked in ray.wait. * Reduce time to ray.wait in the test.	2018-09-13 15:28:17 -07:00
Hanwei Jin	fbf214e408	update ray cmake build process (#2853 ) * use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance * support boost external project, avoid using the system or build.sh boost * keep compatible with build.sh, remove boost and arrow build from it. * bugfix: parquet bison version control, plasma_java lib install problem * bugfix: cmake, do not compile plasma java client if no need * bugfix: component failures test timeout machenism has problem for plasma manager failed case * bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master * revert some fix * set arrow python executable, fix format error in component_failures_test.py * make clean arrow python build directory * update cmake code style, back to support cmake minimum version 3.4	2018-09-12 11:19:33 -07:00
old-bear	f3c1194be3	[tune] Add AutoML algorithm of GeneticSearcher (#2699 ) Add new search algorithm (genetic) along with the base framework of the searcher (which performs some basic jobs such as logging, recording and organizing in our project). Note that this is the initial commit. In the following days, we will add example, UT, and other refinements.	2018-09-12 09:17:04 -07:00
Eric Liang	611259b2c7	Re-raise actor initialization errors on method invocation (#2843 ) If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.	2018-09-10 10:51:19 -07:00
Robert Nishihara	bd64c940e9	Push error to driver when monitor raises an exception. (#2834 )	2018-09-07 17:42:45 -07:00
Zhijun Fu	753ba76141	[Issue 2809][xray] Cleanup on driver detach (#2826 ) This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass. The following should happen when a driver exits (either gracefully or ungracefully). #2797 should be enabled and pass. Any actors created by the driver that are still running should be killed. Any workers running tasks for the driver should be killed. Any tasks for the driver in any node_manager queues should be removed. Any future tasks received by a node manager for the driver should be ignored. The driver death notification should only be received once.	2018-09-07 16:11:32 +08:00
Robert Nishihara	3f6ed537a4	Add ray.is_initialized() function. (#2818 ) * Add ray.is_initialized() function. * Add assert.	2018-09-06 21:20:59 -07:00
Eric Liang	995ac24a2c	[rllib] clarify train batch size for PPO (#2793 ) It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.	2018-09-05 12:06:13 -07:00
Eric Liang	df4788e501	[rllib/tune] Add test for fractional gpu support in xray mode; add rllib support for fractional gpu (#2768 ) * frac gpu * doc * Update rllib-training.rst * yapf * remove xray	2018-09-03 11:12:23 -07:00
Eric Liang	b37a283053	[rllib] support local mode (#2795 )	2018-09-02 23:02:19 -07:00
Robert Nishihara	0ac855e061	Push errors to all drivers when node is marked dead. (#2808 ) * Push errors to all drivers when node is marked dead. * Fix	2018-09-02 20:04:58 -07:00
Robert Nishihara	c71bbbc3af	Add test (currently skipped) that drivers release resources when exiting. (#2797 ) * Add test (currently skipped) that drivers release resources when exiting. * Add test for ungraceful driver exit. * Small fix. * Small fix	2018-09-02 17:34:48 -07:00
Robert Nishihara	1c50082498	Re-enable sharded monitor test for xray, convert to pytest. (#2804 )	2018-09-01 19:53:40 -07:00
Alexey Tumanov	fdc9688226	[xray] push warning to driver for infeasible tasks (#2784 ) This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.	2018-09-01 13:21:27 -07:00
Yucong He	5b45f0bdff	[xray] Implementing Gcs sharding (#2409 ) Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.). [+] Implement sharding for gcs tables. [+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented. [+] Move AsyncGcsClient's initialization into Connect function. [-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.	2018-08-31 15:54:30 -07:00
Robert Nishihara	eda6ebb87d	Convert some unittests to pytest. (#2779 ) * Convert multi_node_test.py to pytest. * Convert array_test.py to pytest. * Convert failure_test.py to pytest. * Convert microbenchmarks to pytest. * Convert component_failures_test.py to pytest and some minor quotes changes. * Convert tensorflow_test.py to pytest. * Convert actor_test.py to pytest. * Fix. * Fix	2018-08-31 11:24:15 -07:00
Richard Liaw	0347e6418b	[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708 )	2018-08-30 16:18:56 -07:00
Robert Nishihara	32f7d6fcf5	Add back some tests for xray. (#2772 )	2018-08-30 11:07:23 -07:00
Robert Nishihara	132f133214	Limit number of concurrent workers started by hardware concurrency. (#2753 ) * Limit number of concurrent workers started by hardware concurrency. * Check if std:🧵:hardware_concurrency() returns 0. * Pass in max concurrency from Python. * Fix Java call to startRaylet. * Fix typo * Remove unnecessary cast. * Fix linting. * Cleanups on Java side. * Comment back in actor test. * Require maximum_startup_concurrency to be at least 1. * Fix linting and test. * Improve documentation. * Fix typo.	2018-08-29 14:53:40 +08:00
Robert Nishihara	b7722897b4	Deprecate 'driver_mode' argument. (#2758 ) * Deprecate 'driver_mode' argument. * Fix * Fix	2018-08-28 16:45:49 -07:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00

1 2 3 4 5 ...

479 commits