hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Kristian Hartikainen	be6567e6fd	Tweak/exec attach info (#3447 ) * Add custom cluster name to exec info * Update submit info to match exec info	2018-12-03 21:39:43 -08:00
Eric Liang	d8205976e8	[rllib] Auto clip actions to Box space range; deprecate squash_to_range (#3426 ) * fix clip * tweak wording * remove squash entirely * Update rllib-models.rst * fix argument order * Apply suggestions from code review Co-Authored-By: ericl <ekhliang@gmail.com>	2018-12-03 19:55:25 -08:00
Eric Liang	7abfbfd2f7	[rllib] Better error message for unsupported non-atari image observation sizes (#3444 )	2018-12-03 01:24:36 -08:00
Stephanie Wang	4abafd7e62	Fix bug in ray.wait (#3445 ) ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely: 1. Objects A and B are put in the cluster. 2. Client calls ray.wait([A, B], num_returns=1). 3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each. 4. Callback for A fires. The wait completes and the request is removed. 5. Callback for B fires. The wait request no longer exists and raylet crashes.	2018-12-01 19:40:33 -08:00
Eric Liang	13c8ce4d84	Update README.rst with 0.6.0 version number. (#3453 )	2018-12-01 19:16:45 -08:00
Philipp Moritz	c5b5cdae33	Upgrade Arrow to include Plasma TensorFlow Op release fix (#3448 ) This includes a fix so the TensorFlow op releases memory properly (https://github.com/apache/arrow/pull/3061) and the possibility to store arrow data structures in plasma (https://github.com/apache/arrow/pull/2832). https://github.com/ray-project/ray/issues/3404	2018-12-01 16:15:09 -08:00
Hao Chen	abd37df41e	Add stress test for Java worker (#3424 )	2018-12-01 16:11:09 -08:00
Robert Nishihara	0603e0b73a	Bump version from 0.5.3 to 0.6.0. (#3420 )	2018-12-01 11:39:36 -08:00
Devin Petersohn	57512616e1	Update readme to contain logo (#3443 ) * Adding logo to readme * Updating link * Add badge * Addressing comments * Moving logo * Change align * Move image	2018-11-30 18:28:35 -08:00
GiliR4t1qbit	454d3aa07d	[docs] Snippet did not have a code-block tag above it (#3442 )	2018-11-30 16:39:40 -08:00
Stephanie Wang	447604a9fe	Use actor ID for the dummy object (#3437 )	2018-11-29 22:31:04 -08:00
Eric Liang	07d8cbf414	[rllib] Support batch norm layers (#3369 ) * batch norm * lint * fix dqn/ddpg update ops * bn model * Update tf_policy_graph.py * Update multi_gpu_impl.py * Apply suggestions from code review Co-Authored-By: ericl <ekhliang@gmail.com>	2018-11-29 13:33:39 -08:00
Devin Petersohn	4d2010a852	Ship Modin with Ray. (#3109 )	2018-11-29 20:05:24 +01:00
Stephanie Wang	48a5935224	Fault tolerance for actor creation (#3422 ) * Add regression test * Request actor creation if no actor location found * Comments * Address comments * Increase test timeout * Trigger test	2018-11-29 10:48:35 -08:00
Chunyang Wen	fd7e494344	Remove: duplicate feed_dict constructing (#3431 )	2018-11-29 10:21:46 -08:00
Kristian Hartikainen	7e319dbf0c	Automatically indent tune logger params (#3399 )	2018-11-29 00:15:50 -08:00
Eric Liang	c46ea2ff4b	Click 0.7 changes the naming convention for commands; fix this	2018-11-28 14:59:58 -08:00
Tianming Xu	139fbf7884	Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403 )	2018-11-27 23:51:18 -08:00
Robert Nishihara	82863b5251	[autoscaler] Update autoscaler to use heartbeat batches. (#3409 )	2018-11-27 23:46:27 -08:00
Eric Liang	f0df97db6f	[rllib] example and docs on how to use parametric actions with DQN / PG algorithms (#3384 )	2018-11-27 23:35:19 -08:00
Eric Liang	c2108ca64f	Don't put entire actor registry in debug string since it's too long (#3395 )	2018-11-27 16:48:12 -08:00
Eric Liang	0d56fc10cc	Move setproctitle to ray[debug] package (#3415 )	2018-11-27 09:50:59 -08:00
Robert Nishihara	20b8b1d891	Add script for running stress tests. (#3378 ) * Add script for running stress tests. * Add an actor tree test where actors die with some probability * Improve test. * Small fix * Update tests. * Minor change	2018-11-27 04:28:02 -08:00
Eric Liang	e3c088fa1e	[rllib] PPO doesn't work with fractional num gpus (#3396 ) * frac ppo * gpu test	2018-11-27 01:14:10 -08:00
Eric Liang	aa94d3dd50	[autoscaler] Allow more than 5s from node creation to first heartbeat (#3385 )	2018-11-26 17:25:05 -08:00
Robert Nishihara	0f0099fb90	UI changes, fix the task timeline and add the object transfer timeline to UI. (#3397 ) * Saving * Fix cmake and remove object/task search boxes. * Add comment	2018-11-25 10:16:49 -08:00
Eric Liang	b85e7b43f3	[rllib] Refactor the sampler (#3387 ) * refactor * fix test * add perf test * Update sampler.py	2018-11-24 18:16:54 -08:00
Robert Nishihara	3856533065	Fix incompatibility with most recent version of Redis. (#3379 ) * Fix incompatibility with most recent version of Redis. * Fix * Fixes.	2018-11-24 16:36:38 -08:00
Eric Liang	18a8dbfcfb	[rllib] Clip DDPG ou-noise to avoid exceeding action bounds (#3386 ) Closes #2965	2018-11-24 00:56:50 -08:00
Eric Liang	55fca828ce	[rllib] Fix use_lstm option when using custom model with dict space (#3368 ) ## What do these changes do? This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation. ## Related issue number Closes https://github.com/ray-project/ray/issues/3367	2018-11-23 22:51:08 -08:00
Eric Liang	8b76bab25c	[rllib] docs for td3 (#3381 ) * td3 doc * Update rllib-env.rst	2018-11-22 13:36:47 -08:00
Eric Liang	41b6b50d09	fix py3 (#3382 )	2018-11-22 11:43:52 -08:00
GiliR4t1qbit	b9ae5edf74	When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised (#3383 )	2018-11-22 09:42:58 -08:00
Jones Wong	24bfe8ab76	Enable Twin Delayed DDPG for RLlib DDPG agent (#3353 )	2018-11-21 20:03:20 -08:00
Stephanie Wang	6b3236349c	Fix memory leak in lineage cache (#3366 ) * Move children_ map inside Lineage * Update lineage_cache.cc * Test and fixes * Remove unused	2018-11-21 16:18:39 -08:00
Richard Liaw	784a6399b0	[tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.	2018-11-21 12:38:16 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00
Philipp Moritz	1a926c9b7c	Fix $MACOSX_DEPLOYMENT_TARGET (#3337 )	2018-11-21 10:56:17 -08:00
Eric Liang	686cf20951	Remove uses of std::list::size (#3358 ) * worker pool and client conn * Fix linting * unordered set * move	2018-11-20 14:47:55 -08:00
Richard Liaw	c24d87b4d1	[autoscaler] Submit command (#3312 )	2018-11-20 14:03:34 -08:00
Philipp Moritz	d3697ce4e1	Ready queue refactor to make Dispatching tasks more efficient (#3324 ) * put queues outside * working version, still needs to be optimized * implement round robin * proper round robin * fix spillback * update * fix * cleanup * more cleanups * fix * fix * add documentation * explanation for hash combiner * speed it up * cleanup and linting * linting * comments * Update scheduling_queue.h * temp commit * fixes * update * fix * cleanup * cleanup * lint * more prints * more prints * increase sleep * documentation * sleep * fix * fix * sleep longer * update * fix * fix * fix * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * fixes * use ordered set * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator * fix * fix test * linting * lint * update * add documentation * linting	2018-11-20 13:14:12 -08:00
Ujval Misra	b0bfd104f2	Batch heartbeats from node manager together in the monitor. (#3011 )	2018-11-20 09:52:27 -08:00
Eric Liang	abdc3b592e	[rllib] Update multi-gpu impala numbers (#3327 )	2018-11-19 20:55:27 -08:00
Eric Liang	5972c29d28	[rllib] Set ape-x local exploration to 0, also load explorations before training steps (#3349 ) ## What do these changes do? This should fix high explorations being used after restore / for rollouts. ## Related issue number (dev list issue)	2018-11-19 20:36:25 -08:00
Eric Liang	afc48d7b77	Don't setpgid() on actors (#3347 )	2018-11-19 17:35:26 -08:00
Robert Nishihara	f2b5500642	Add ordered_set container. (#3352 ) * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator	2018-11-19 17:01:18 -08:00
Eric Liang	d4dbd27e0d	Don't retry IPC connect an absurd number of times (#3355 )	2018-11-19 16:23:59 -08:00
Eric Liang	e4bb5d8d16	Fix logging when ray cluster utils is used	2018-11-18 21:49:27 -08:00
Eric Liang	61e3bbbfee	Update stale example links	2018-11-17 15:40:38 -08:00
Robert Nishihara	5cbc597494	Suppress duplicate pre-emptive object pushes. (#3276 ) * Suppress duplicate pre-emptive object pushes. * Add test. * Fix linting * Remove timer and inline recent_pushes_ into local_objects_. * Improve test. * Fix * Fix linting * Enable retrying pull from same object manager. Randomize object manager. * Speed up test * Linting * Add test. * Minor * Lengthen pull timeout and reissue pull every time a new object becomes available. * Increase pull timeout in test. * Wait for nodes to start in object manager test. * Wait longer for nodes to start up in test. * Small fixes. * _submit -> _remote * Change assert to warning.	2018-11-16 23:02:45 -08:00

1 2 3 4 5 ...

2244 commits