hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-08 19:41:38 -05:00

Author	SHA1	Message	Date
Robert Nishihara	20b8b1d891	Add script for running stress tests. (#3378 ) * Add script for running stress tests. * Add an actor tree test where actors die with some probability * Improve test. * Small fix * Update tests. * Minor change	2018-11-27 04:28:02 -08:00
Eric Liang	e3c088fa1e	[rllib] PPO doesn't work with fractional num gpus (#3396 ) * frac ppo * gpu test	2018-11-27 01:14:10 -08:00
Robert Nishihara	3856533065	Fix incompatibility with most recent version of Redis. (#3379 ) * Fix incompatibility with most recent version of Redis. * Fix * Fixes.	2018-11-24 16:36:38 -08:00
Eric Liang	55fca828ce	[rllib] Fix use_lstm option when using custom model with dict space (#3368 ) ## What do these changes do? This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation. ## Related issue number Closes https://github.com/ray-project/ray/issues/3367	2018-11-23 22:51:08 -08:00
Stephanie Wang	6b3236349c	Fix memory leak in lineage cache (#3366 ) * Move children_ map inside Lineage * Update lineage_cache.cc * Test and fixes * Remove unused	2018-11-21 16:18:39 -08:00
Richard Liaw	784a6399b0	[tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.	2018-11-21 12:38:16 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00
Philipp Moritz	d3697ce4e1	Ready queue refactor to make Dispatching tasks more efficient (#3324 ) * put queues outside * working version, still needs to be optimized * implement round robin * proper round robin * fix spillback * update * fix * cleanup * more cleanups * fix * fix * add documentation * explanation for hash combiner * speed it up * cleanup and linting * linting * comments * Update scheduling_queue.h * temp commit * fixes * update * fix * cleanup * cleanup * lint * more prints * more prints * increase sleep * documentation * sleep * fix * fix * sleep longer * update * fix * fix * fix * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * fixes * use ordered set * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator * fix * fix test * linting * lint * update * add documentation * linting	2018-11-20 13:14:12 -08:00
Ujval Misra	b0bfd104f2	Batch heartbeats from node manager together in the monitor. (#3011 )	2018-11-20 09:52:27 -08:00
Robert Nishihara	5cbc597494	Suppress duplicate pre-emptive object pushes. (#3276 ) * Suppress duplicate pre-emptive object pushes. * Add test. * Fix linting * Remove timer and inline recent_pushes_ into local_objects_. * Improve test. * Fix * Fix linting * Enable retrying pull from same object manager. Randomize object manager. * Speed up test * Linting * Add test. * Minor * Lengthen pull timeout and reissue pull every time a new object becomes available. * Increase pull timeout in test. * Wait for nodes to start in object manager test. * Wait longer for nodes to start up in test. * Small fixes. * _submit -> _remote * Change assert to warning.	2018-11-16 23:02:45 -08:00
Robert Nishihara	60b22d9a72	Don't unsubscribe dependencies for infeasible tasks. (#3338 ) * Make scheduling queues RemoveTasks return task states as well. * Add test * Don't unsubscribe for infeasible tasks when spilling over. * Linting * Address comments.	2018-11-16 11:33:00 -08:00
Robert Nishihara	d10cb570ab	Rename _submit -> _remote. (#3321 )	2018-11-15 15:30:18 -08:00
Philipp Moritz	1be1455d86	Fix redis crash when duplicate messages are appended to log. (#3316 )	2018-11-15 15:09:39 -08:00
Eric Liang	706dc1d473	[rllib] Add test for multi-agent support and fix IMPALA multi-agent (#3289 ) IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches. Fix this by adding zero-padding as needed (similar to the RNN case).	2018-11-14 14:14:07 -08:00
Eric Liang	1660c9d627	Kill actor child processes on shutdown (#3297 ) * example * add env * test pg * change to test * add atexit test * Update rllib-env.rst * comment * revert unnecessary file * fix title when actor is idle * Update python/ray/actor.py Co-Authored-By: ericl <ekhliang@gmail.com>	2018-11-13 19:16:42 -08:00
Eric Liang	65c27c70cf	[rllib] Clean up agent resource configurations (#3296 ) Closes #3284	2018-11-13 18:00:03 -08:00
Richard Liaw	97f423781b	Clean up Ray processes after cluster util exits (#3278 )	2018-11-13 13:18:12 -08:00
Eric Liang	bd0dbde149	[rllib] Rename ServingEnv => ExternalEnv (#3302 )	2018-11-12 16:31:27 -08:00
Eric Liang	53489d2f85	[sgd] Document and add simple MNIST example (#3236 )	2018-11-10 21:52:20 -08:00
Stephanie Wang	d950e92f63	Allow multiple threads to call ray.get and ray.wait (#3244 ) * Handle multiple threads calling ray.get * Multithreaded ray.wait * Pass in current task ID in java backend * Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get * Fix test * Some cleanups * Improve error message * Add assertion * Cleanup, throw error in HandleTaskUnblocked if task not actually blocked * lint * Fix python worker reset * Fix references to reconstruct_objects * Linting * java lint * Fix java * Fix iterator	2018-11-07 22:39:28 -08:00
Richard Liaw	0bab8ed95c	Expose internal config parameters for starting Ray (#3246 ) ## What do these changes do? This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly. Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible. #3239 depends on this. TODO: - [x] Add documentation to method arguments before merging. - [x] Add test to verify this works? ## Related issue number	2018-11-07 21:46:02 -08:00
Robert Nishihara	1dd5d92789	Enable timeline visualizations of object transfers. (#3255 ) * Plot object transfers. * Linting	2018-11-07 12:45:59 -08:00
Eric Liang	725df3a485	Set the process title in workers and actors (#3219 )	2018-11-06 14:59:22 -08:00
Stephanie Wang	bf88aa5013	Increase timeout before reconstruction is triggered (#3217 ) * Increase timeout to 10s * Skip eviction reconstruction tests * Add stress test for many actors to one * Fix test by shortening it. * lower number of processes in stress test * Skip slow test	2018-11-05 18:03:50 -08:00
Eric Liang	813f51769f	[rllib] Fix rllib rollouts script and add test (#3211 ) ## What do these changes do? Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py ## Related issue number https://github.com/ray-project/ray/issues/3206 https://github.com/ray-project/ray/issues/3204	2018-11-05 00:33:25 -08:00
Eric Liang	369cb833fe	[rllib] Implement custom metrics (#3144 )	2018-11-03 18:48:32 -07:00
Eric Liang	9a0f0db070	Add `ray stack` tool for debugging (#3213 )	2018-11-03 13:13:02 -07:00
Wang Qing	ca7d4c2cf5	Enable to specify driver id by user. (#3084 )	2018-11-02 19:01:50 -07:00
Robert Nishihara	5822aa2388	Rename get_task -> worker_idle in timeline. (#3179 ) * Rename get_task -> worker_idle in timeline. * Fix test.	2018-11-02 12:08:46 -07:00
Robert Nishihara	1f29a960f4	Update task_table and object_table API. (#3161 ) * Update task_table and object_table API. * Fix	2018-10-31 12:52:50 -07:00
Robert Nishihara	32f0d6b77e	Deprecate num_workers argument to ray.init and ray start. (#3114 ) * Remove num_workers argument. * Fix * Fix	2018-10-28 20:12:49 -07:00
Robert Nishihara	9868af4c7c	Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. (#3149 ) * Use /tmp instead of /dev/shm for object store on Linux if /dev/shm is too small. * Add logging statement and address comments. * Fix	2018-10-28 20:09:06 -07:00
Robert Nishihara	fd854ff090	Allow the node manager port and object manager port to be set through… (#3130 ) * Allow the node manager port and object manager port to be set through ray start. * Linting * Fix Java test * Address comments.	2018-10-28 17:28:41 -07:00
Eric Liang	af0c1174cd	[sgd] Merge sharded param server based SGD implementation (#3033 ) This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly. $ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \ --devices-per-worker=M --strategy=<simple\|ps> \ --warmup --object-store-memory=10000000000 Images per second total gpus total \| simple \| ps ======================================== 1 \| 218 2 (1 worker) \| 388 4 (1 worker) \| 759 4 (2 workers) \| 176 \| 623 8 (1 worker) \| 985 8 (2 workers) \| 349 \| 1031 16 (2 nodes, 2 workers) \| 600 \| 1661 16 (2 nodes, 4 workers) \| 468 \| 1712 <--- OSDI perf was 1817	2018-10-27 21:25:02 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Robert Nishihara	5aa29613db	Fix linting errors. (#3127 )	2018-10-24 16:30:00 -07:00
Robert Nishihara	9c1826ed69	Use XRay backend by default. (#3020 ) * Use XRay backend by default. * Remove irrelevant valgrind tests. * Fix * Move tests around. * Fix * Fix test * Fix test. * String/unicode fix. * Fix test * Fix unicode issue. * Minor changes * Fix bug in test_global_state.py. * Fix test. * Linting * Try arrow change and other object manager changes. * Use newer plasma client API * Small updates * Revert plasma client api change. * Update * Update arrow and allow SendObjectHeaders to fail. * Update arrow * Update python/ray/experimental/state.py Co-Authored-By: robertnishihara <robertnishihara@gmail.com> * Address comments.	2018-10-23 12:46:39 -07:00
Robert Nishihara	22dd7e0428	Add test for wait reconstruction. (#3110 )	2018-10-22 23:16:54 -07:00
Richard Liaw	40c4148d4f	Cluster Utilities for Fault Tolerance Tests (#3008 )	2018-10-20 22:56:29 -07:00
Eric Liang	59901a88a0	[rllib] Native support for Dict and Tuple spaces; fix Tuple action spaces; add prev a, r to LSTM (#3051 )	2018-10-20 15:21:22 -07:00
Philipp Moritz	2c52d9dfa0	Fix actor handle id creation when actor handle was pickled (#3074 )	2018-10-17 18:00:52 -07:00
Eric Liang	3c891c6ece	[rllib] Parallel-data loading and multi-gpu support for IMPALA (#2766 )	2018-10-15 11:02:50 -07:00
Robert Nishihara	faa31ae018	Introduce concept of resources required for placing a task. (#2837 ) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java	2018-10-04 10:35:39 -07:00
Richard Liaw	01bb073569	Suppress errors when worker or driver intentionally disconnects. (#2935 )	2018-10-04 00:06:34 -07:00
Si-Yuan	cc7e2ecdd5	Change logfile names and also allow plasma store socket to be passed in. (#2862 )	2018-10-03 10:03:53 -07:00
Robert Nishihara	3ce8eb2d4c	Test dying_worker_get and dying_worker_wait for xray. (#2997 ) This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.	2018-10-02 00:08:47 -07:00
Eric Liang	e4bea8d10e	[rllib] Default to truncate_episodes and add some more config validators (#2967 ) * update * link it * warn about truncation * fix * Update rllib-training.rst * deprecate tests failing	2018-09-30 18:37:55 -07:00
Robert Nishihara	ed6289771a	Convert runtest.py to use pytest. (#2966 ) * Convert runtest.py to use pytest. * Linting. * Fix * Fix * Fix * Fix	2018-09-30 07:59:44 -07:00
Eric Liang	747253e0f6	[rllib] Don't shuffle samples in PPO when using lstm	2018-09-30 01:13:56 -07:00
Eric Liang	3267676994	[Experimental] Add experimental distributed SGD API (#2858 ) * check in sgd api * idx * foreach_worker foreach_model * add feed_dict * update * yapf * typo * lint * plasma op change * fix plasma op * still not working * fix * fix * comments * yapf * silly flake8 * small test	2018-09-19 21:12:37 -07:00

1 2 3 4 5 ...

503 commits