hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 21:06:39 -04:00

Author	SHA1	Message	Date
Richard Liaw	784a6399b0	[tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.	2018-11-21 12:38:16 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00
Philipp Moritz	d3697ce4e1	Ready queue refactor to make Dispatching tasks more efficient (#3324 ) * put queues outside * working version, still needs to be optimized * implement round robin * proper round robin * fix spillback * update * fix * cleanup * more cleanups * fix * fix * add documentation * explanation for hash combiner * speed it up * cleanup and linting * linting * comments * Update scheduling_queue.h * temp commit * fixes * update * fix * cleanup * cleanup * lint * more prints * more prints * increase sleep * documentation * sleep * fix * fix * sleep longer * update * fix * fix * fix * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * fixes * use ordered set * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator * fix * fix test * linting * lint * update * add documentation * linting	2018-11-20 13:14:12 -08:00
Ujval Misra	b0bfd104f2	Batch heartbeats from node manager together in the monitor. (#3011 )	2018-11-20 09:52:27 -08:00
Robert Nishihara	5cbc597494	Suppress duplicate pre-emptive object pushes. (#3276 ) * Suppress duplicate pre-emptive object pushes. * Add test. * Fix linting * Remove timer and inline recent_pushes_ into local_objects_. * Improve test. * Fix * Fix linting * Enable retrying pull from same object manager. Randomize object manager. * Speed up test * Linting * Add test. * Minor * Lengthen pull timeout and reissue pull every time a new object becomes available. * Increase pull timeout in test. * Wait for nodes to start in object manager test. * Wait longer for nodes to start up in test. * Small fixes. * _submit -> _remote * Change assert to warning.	2018-11-16 23:02:45 -08:00
Robert Nishihara	60b22d9a72	Don't unsubscribe dependencies for infeasible tasks. (#3338 ) * Make scheduling queues RemoveTasks return task states as well. * Add test * Don't unsubscribe for infeasible tasks when spilling over. * Linting * Address comments.	2018-11-16 11:33:00 -08:00
Robert Nishihara	d10cb570ab	Rename _submit -> _remote. (#3321 )	2018-11-15 15:30:18 -08:00
Philipp Moritz	1be1455d86	Fix redis crash when duplicate messages are appended to log. (#3316 )	2018-11-15 15:09:39 -08:00
Eric Liang	706dc1d473	[rllib] Add test for multi-agent support and fix IMPALA multi-agent (#3289 ) IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches. Fix this by adding zero-padding as needed (similar to the RNN case).	2018-11-14 14:14:07 -08:00
Eric Liang	1660c9d627	Kill actor child processes on shutdown (#3297 ) * example * add env * test pg * change to test * add atexit test * Update rllib-env.rst * comment * revert unnecessary file * fix title when actor is idle * Update python/ray/actor.py Co-Authored-By: ericl <ekhliang@gmail.com>	2018-11-13 19:16:42 -08:00
Eric Liang	65c27c70cf	[rllib] Clean up agent resource configurations (#3296 ) Closes #3284	2018-11-13 18:00:03 -08:00
Richard Liaw	97f423781b	Clean up Ray processes after cluster util exits (#3278 )	2018-11-13 13:18:12 -08:00
Eric Liang	bd0dbde149	[rllib] Rename ServingEnv => ExternalEnv (#3302 )	2018-11-12 16:31:27 -08:00
Eric Liang	53489d2f85	[sgd] Document and add simple MNIST example (#3236 )	2018-11-10 21:52:20 -08:00
Stephanie Wang	d950e92f63	Allow multiple threads to call ray.get and ray.wait (#3244 ) * Handle multiple threads calling ray.get * Multithreaded ray.wait * Pass in current task ID in java backend * Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get * Fix test * Some cleanups * Improve error message * Add assertion * Cleanup, throw error in HandleTaskUnblocked if task not actually blocked * lint * Fix python worker reset * Fix references to reconstruct_objects * Linting * java lint * Fix java * Fix iterator	2018-11-07 22:39:28 -08:00
Richard Liaw	0bab8ed95c	Expose internal config parameters for starting Ray (#3246 ) ## What do these changes do? This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly. Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible. #3239 depends on this. TODO: - [x] Add documentation to method arguments before merging. - [x] Add test to verify this works? ## Related issue number	2018-11-07 21:46:02 -08:00
Robert Nishihara	1dd5d92789	Enable timeline visualizations of object transfers. (#3255 ) * Plot object transfers. * Linting	2018-11-07 12:45:59 -08:00
Eric Liang	725df3a485	Set the process title in workers and actors (#3219 )	2018-11-06 14:59:22 -08:00
Stephanie Wang	bf88aa5013	Increase timeout before reconstruction is triggered (#3217 ) * Increase timeout to 10s * Skip eviction reconstruction tests * Add stress test for many actors to one * Fix test by shortening it. * lower number of processes in stress test * Skip slow test	2018-11-05 18:03:50 -08:00
Eric Liang	813f51769f	[rllib] Fix rllib rollouts script and add test (#3211 ) ## What do these changes do? Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py ## Related issue number https://github.com/ray-project/ray/issues/3206 https://github.com/ray-project/ray/issues/3204	2018-11-05 00:33:25 -08:00
Eric Liang	369cb833fe	[rllib] Implement custom metrics (#3144 )	2018-11-03 18:48:32 -07:00
Eric Liang	9a0f0db070	Add `ray stack` tool for debugging (#3213 )	2018-11-03 13:13:02 -07:00
Wang Qing	ca7d4c2cf5	Enable to specify driver id by user. (#3084 )	2018-11-02 19:01:50 -07:00
Robert Nishihara	5822aa2388	Rename get_task -> worker_idle in timeline. (#3179 ) * Rename get_task -> worker_idle in timeline. * Fix test.	2018-11-02 12:08:46 -07:00
Robert Nishihara	1f29a960f4	Update task_table and object_table API. (#3161 ) * Update task_table and object_table API. * Fix	2018-10-31 12:52:50 -07:00
Robert Nishihara	32f0d6b77e	Deprecate num_workers argument to ray.init and ray start. (#3114 ) * Remove num_workers argument. * Fix * Fix	2018-10-28 20:12:49 -07:00
Robert Nishihara	9868af4c7c	Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149 ) * Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. * Add logging statement and address comments. * Fix	2018-10-28 20:09:06 -07:00
Robert Nishihara	fd854ff090	Allow the node manager port and object manager port to be set through… (#3130 ) * Allow the node manager port and object manager port to be set through ray start. * Linting * Fix Java test * Address comments.	2018-10-28 17:28:41 -07:00
Eric Liang	af0c1174cd	[sgd] Merge sharded param server based SGD implementation (#3033 ) This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly. $ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \ --devices-per-worker=M --strategy=<simple\|ps> \ --warmup --object-store-memory=10000000000 Images per second total gpus total \| simple \| ps ======================================== 1 \| 218 2 (1 worker) \| 388 4 (1 worker) \| 759 4 (2 workers) \| 176 \| 623 8 (1 worker) \| 985 8 (2 workers) \| 349 \| 1031 16 (2 nodes, 2 workers) \| 600 \| 1661 16 (2 nodes, 4 workers) \| 468 \| 1712 <--- OSDI perf was 1817	2018-10-27 21:25:02 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Robert Nishihara	5aa29613db	Fix linting errors. (#3127 )	2018-10-24 16:30:00 -07:00
Robert Nishihara	9c1826ed69	Use XRay backend by default. (#3020 ) * Use XRay backend by default. * Remove irrelevant valgrind tests. * Fix * Move tests around. * Fix * Fix test * Fix test. * String/unicode fix. * Fix test * Fix unicode issue. * Minor changes * Fix bug in test_global_state.py. * Fix test. * Linting * Try arrow change and other object manager changes. * Use newer plasma client API * Small updates * Revert plasma client api change. * Update * Update arrow and allow SendObjectHeaders to fail. * Update arrow * Update python/ray/experimental/state.py Co-Authored-By: robertnishihara <robertnishihara@gmail.com> * Address comments.	2018-10-23 12:46:39 -07:00
Robert Nishihara	22dd7e0428	Add test for wait reconstruction. (#3110 )	2018-10-22 23:16:54 -07:00
Richard Liaw	40c4148d4f	Cluster Utilities for Fault Tolerance Tests (#3008 )	2018-10-20 22:56:29 -07:00
Eric Liang	59901a88a0	[rllib] Native support for Dict and Tuple spaces; fix Tuple action spaces; add prev a, r to LSTM (#3051 )	2018-10-20 15:21:22 -07:00
Philipp Moritz	2c52d9dfa0	Fix actor handle id creation when actor handle was pickled (#3074 )	2018-10-17 18:00:52 -07:00
Eric Liang	3c891c6ece	[rllib] Parallel-data loading and multi-gpu support for IMPALA (#2766 )	2018-10-15 11:02:50 -07:00
Robert Nishihara	faa31ae018	Introduce concept of resources required for placing a task. (#2837 ) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java	2018-10-04 10:35:39 -07:00
Richard Liaw	01bb073569	Suppress errors when worker or driver intentionally disconnects. (#2935 )	2018-10-04 00:06:34 -07:00
Si-Yuan	cc7e2ecdd5	Change logfile names and also allow plasma store socket to be passed in. (#2862 )	2018-10-03 10:03:53 -07:00
Robert Nishihara	3ce8eb2d4c	Test dying_worker_get and dying_worker_wait for xray. (#2997 ) This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.	2018-10-02 00:08:47 -07:00
Eric Liang	e4bea8d10e	[rllib] Default to truncate_episodes and add some more config validators (#2967 ) * update * link it * warn about truncation * fix * Update rllib-training.rst * deprecate tests failing	2018-09-30 18:37:55 -07:00
Robert Nishihara	ed6289771a	Convert runtest.py to use pytest. (#2966 ) * Convert runtest.py to use pytest. * Linting. * Fix * Fix * Fix * Fix	2018-09-30 07:59:44 -07:00
Eric Liang	747253e0f6	[rllib] Don't shuffle samples in PPO when using lstm	2018-09-30 01:13:56 -07:00
Eric Liang	3267676994	[Experimental] Add experimental distributed SGD API (#2858 ) * check in sgd api * idx * foreach_worker foreach_model * add feed_dict * update * yapf * typo * lint * plasma op change * fix plasma op * still not working * fix * fix * comments * yapf * silly flake8 * small test	2018-09-19 21:12:37 -07:00
Eric Liang	3a3782c39f	[rllib] Fix LSTM regression on truncated sequences and add regression test (#2898 ) * fix * add test * yapf * yapf * fix space * Oops that should be lstm: True * Update cartpole_lstm.py	2018-09-18 15:09:16 -07:00
Robert Nishihara	ea9d1cc887	Remove dependence on psutil. Add utility functions for getting system memory. (#2892 )	2018-09-18 15:03:29 +08:00
Hanwei Jin	dc76e51a60	bugfix: cmake copy plasma java lib from lib64 directory in centos (#2885 )	2018-09-16 22:32:09 -07:00
Robert Nishihara	f16d33593b	Mark worker as blocked and trigger reconstruction in ray.wait. (#2864 ) * Trigger reconstruction in ray.wait and mark worker as blocked. * Add test. * Linting. * Don't run new test with legacy Ray. * Only call HandleClientUnblocked if it actually blocked in ray.wait. * Reduce time to ray.wait in the test.	2018-09-13 15:28:17 -07:00
Hanwei Jin	fbf214e408	update ray cmake build process (#2853 ) * use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance * support boost external project, avoid using the system or build.sh boost * keep compatible with build.sh, remove boost and arrow build from it. * bugfix: parquet bison version control, plasma_java lib install problem * bugfix: cmake, do not compile plasma java client if no need * bugfix: component failures test timeout machenism has problem for plasma manager failed case * bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master * revert some fix * set arrow python executable, fix format error in component_failures_test.py * make clean arrow python build directory * update cmake code style, back to support cmake minimum version 3.4	2018-09-12 11:19:33 -07:00

1 2 3 4 5 ...

598 commits