hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 05:16:49 -04:00

Author	SHA1	Message	Date
Yuhong Guo	b9e1977fae	Fix failure of test_free_objects_multi_node (#3481 ) It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure. The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client. In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.	2018-12-06 15:55:49 -05:00
Eric Liang	412aaa5195	[tune] Deprecate ambiguous function values (use tune.function / tune.sample_from instead) (#3457 ) * wip * exclude	2018-12-06 11:35:20 -08:00
Eric Liang	d864f299d7	[rllib] fixes from dogfooding multi-agent (#3456 ) auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler add some Q-learning debug stats report min, max of custom metrics better errors	2018-12-05 23:31:45 -08:00
shane	7a79b7f62c	increase container memory and shm to 20G (#3475 ) * increase container memory and shm to 20G * variables are POWERFUL	2018-12-05 14:59:07 -08:00
Si-Yuan	2e6f9bedf2	Add the extra fallback for serialization (#3468 ) * Add the extra fallback for serialization. * Better comments & warnings. quotes. * Update test/runtest.py Co-Authored-By: suquark <suquark@gmail.com> * Update test/runtest.py Co-Authored-By: suquark <suquark@gmail.com> * linting * Don't hijack too much errors. * simplify the test * Update runtest.py * simplify	2018-12-05 13:09:08 -08:00
Philipp Moritz	06f6431765	Make test_actor_multiple_gpus_from_multiple_tasks less stressful in travis	2018-12-04 17:44:33 -08:00
Eric Liang	93a9d32288	[docs] Switch docs to use rllib train instead of train.py	2018-12-04 17:36:06 -08:00
Richard Liaw	9d0bd50e78	[tune] Component notification on node failure + Tests (#3414 ) Changes include: - Notify Components on Requeue - Slight refactoring of Node Failure handling - Better tests	2018-12-04 14:47:31 -08:00
Eric Liang	ce355d13d4	[rllib] Allow envs to be auto-registered; add on_train_result callback with curriculum example (#3451 ) * train step and docs * debug * doc * doc * fix examples * fix code * integration test * fix * ... * space * instance * Update .travis.yml * fix test	2018-12-03 23:15:43 -08:00
Kristian Hartikainen	be6567e6fd	Tweak/exec attach info (#3447 ) * Add custom cluster name to exec info * Update submit info to match exec info	2018-12-03 21:39:43 -08:00
Eric Liang	d8205976e8	[rllib] Auto clip actions to Box space range; deprecate squash_to_range (#3426 ) * fix clip * tweak wording * remove squash entirely * Update rllib-models.rst * fix argument order * Apply suggestions from code review Co-Authored-By: ericl <ekhliang@gmail.com>	2018-12-03 19:55:25 -08:00
Eric Liang	7abfbfd2f7	[rllib] Better error message for unsupported non-atari image observation sizes (#3444 )	2018-12-03 01:24:36 -08:00
Stephanie Wang	4abafd7e62	Fix bug in ray.wait (#3445 ) ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely: 1. Objects A and B are put in the cluster. 2. Client calls ray.wait([A, B], num_returns=1). 3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each. 4. Callback for A fires. The wait completes and the request is removed. 5. Callback for B fires. The wait request no longer exists and raylet crashes.	2018-12-01 19:40:33 -08:00
Eric Liang	13c8ce4d84	Update README.rst with 0.6.0 version number. (#3453 )	2018-12-01 19:16:45 -08:00
Philipp Moritz	c5b5cdae33	Upgrade Arrow to include Plasma TensorFlow Op release fix (#3448 ) This includes a fix so the TensorFlow op releases memory properly (https://github.com/apache/arrow/pull/3061) and the possibility to store arrow data structures in plasma (https://github.com/apache/arrow/pull/2832). https://github.com/ray-project/ray/issues/3404	2018-12-01 16:15:09 -08:00
Hao Chen	abd37df41e	Add stress test for Java worker (#3424 )	2018-12-01 16:11:09 -08:00
Robert Nishihara	0603e0b73a	Bump version from 0.5.3 to 0.6.0. (#3420 )	2018-12-01 11:39:36 -08:00
Devin Petersohn	57512616e1	Update readme to contain logo (#3443 ) * Adding logo to readme * Updating link * Add badge * Addressing comments * Moving logo * Change align * Move image	2018-11-30 18:28:35 -08:00
GiliR4t1qbit	454d3aa07d	[docs] Snippet did not have a code-block tag above it (#3442 )	2018-11-30 16:39:40 -08:00
Stephanie Wang	447604a9fe	Use actor ID for the dummy object (#3437 )	2018-11-29 22:31:04 -08:00
Eric Liang	07d8cbf414	[rllib] Support batch norm layers (#3369 ) * batch norm * lint * fix dqn/ddpg update ops * bn model * Update tf_policy_graph.py * Update multi_gpu_impl.py * Apply suggestions from code review Co-Authored-By: ericl <ekhliang@gmail.com>	2018-11-29 13:33:39 -08:00
Devin Petersohn	4d2010a852	Ship Modin with Ray. (#3109 )	2018-11-29 20:05:24 +01:00
Stephanie Wang	48a5935224	Fault tolerance for actor creation (#3422 ) * Add regression test * Request actor creation if no actor location found * Comments * Address comments * Increase test timeout * Trigger test	2018-11-29 10:48:35 -08:00
Chunyang Wen	fd7e494344	Remove: duplicate feed_dict constructing (#3431 )	2018-11-29 10:21:46 -08:00
Kristian Hartikainen	7e319dbf0c	Automatically indent tune logger params (#3399 )	2018-11-29 00:15:50 -08:00
Eric Liang	c46ea2ff4b	Click 0.7 changes the naming convention for commands; fix this	2018-11-28 14:59:58 -08:00
Tianming Xu	139fbf7884	Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403 )	2018-11-27 23:51:18 -08:00
Robert Nishihara	82863b5251	[autoscaler] Update autoscaler to use heartbeat batches. (#3409 )	2018-11-27 23:46:27 -08:00
Eric Liang	f0df97db6f	[rllib] example and docs on how to use parametric actions with DQN / PG algorithms (#3384 )	2018-11-27 23:35:19 -08:00
Eric Liang	c2108ca64f	Don't put entire actor registry in debug string since it's too long (#3395 )	2018-11-27 16:48:12 -08:00
Eric Liang	0d56fc10cc	Move setproctitle to ray[debug] package (#3415 )	2018-11-27 09:50:59 -08:00
Robert Nishihara	20b8b1d891	Add script for running stress tests. (#3378 ) * Add script for running stress tests. * Add an actor tree test where actors die with some probability * Improve test. * Small fix * Update tests. * Minor change	2018-11-27 04:28:02 -08:00
Eric Liang	e3c088fa1e	[rllib] PPO doesn't work with fractional num gpus (#3396 ) * frac ppo * gpu test	2018-11-27 01:14:10 -08:00
Eric Liang	aa94d3dd50	[autoscaler] Allow more than 5s from node creation to first heartbeat (#3385 )	2018-11-26 17:25:05 -08:00
Robert Nishihara	0f0099fb90	UI changes, fix the task timeline and add the object transfer timeline to UI. (#3397 ) * Saving * Fix cmake and remove object/task search boxes. * Add comment	2018-11-25 10:16:49 -08:00
Eric Liang	b85e7b43f3	[rllib] Refactor the sampler (#3387 ) * refactor * fix test * add perf test * Update sampler.py	2018-11-24 18:16:54 -08:00
Robert Nishihara	3856533065	Fix incompatibility with most recent version of Redis. (#3379 ) * Fix incompatibility with most recent version of Redis. * Fix * Fixes.	2018-11-24 16:36:38 -08:00
Eric Liang	18a8dbfcfb	[rllib] Clip DDPG ou-noise to avoid exceeding action bounds (#3386 ) Closes #2965	2018-11-24 00:56:50 -08:00
Eric Liang	55fca828ce	[rllib] Fix use_lstm option when using custom model with dict space (#3368 ) ## What do these changes do? This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation. ## Related issue number Closes https://github.com/ray-project/ray/issues/3367	2018-11-23 22:51:08 -08:00
Eric Liang	8b76bab25c	[rllib] docs for td3 (#3381 ) * td3 doc * Update rllib-env.rst	2018-11-22 13:36:47 -08:00
Eric Liang	41b6b50d09	fix py3 (#3382 )	2018-11-22 11:43:52 -08:00
GiliR4t1qbit	b9ae5edf74	When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised (#3383 )	2018-11-22 09:42:58 -08:00
Jones Wong	24bfe8ab76	Enable Twin Delayed DDPG for RLlib DDPG agent (#3353 )	2018-11-21 20:03:20 -08:00
Stephanie Wang	6b3236349c	Fix memory leak in lineage cache (#3366 ) * Move children_ map inside Lineage * Update lineage_cache.cc * Test and fixes * Remove unused	2018-11-21 16:18:39 -08:00
Richard Liaw	784a6399b0	[tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.	2018-11-21 12:38:16 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00
Philipp Moritz	1a926c9b7c	Fix $MACOSX_DEPLOYMENT_TARGET (#3337 )	2018-11-21 10:56:17 -08:00
Eric Liang	686cf20951	Remove uses of std::list::size (#3358 ) * worker pool and client conn * Fix linting * unordered set * move	2018-11-20 14:47:55 -08:00
Richard Liaw	c24d87b4d1	[autoscaler] Submit command (#3312 )	2018-11-20 14:03:34 -08:00
Philipp Moritz	d3697ce4e1	Ready queue refactor to make Dispatching tasks more efficient (#3324 ) * put queues outside * working version, still needs to be optimized * implement round robin * proper round robin * fix spillback * update * fix * cleanup * more cleanups * fix * fix * add documentation * explanation for hash combiner * speed it up * cleanup and linting * linting * comments * Update scheduling_queue.h * temp commit * fixes * update * fix * cleanup * cleanup * lint * more prints * more prints * increase sleep * documentation * sleep * fix * fix * sleep longer * update * fix * fix * fix * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * fixes * use ordered set * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator * fix * fix test * linting * lint * update * add documentation * linting	2018-11-20 13:14:12 -08:00

... 9 10 11 12 13 ...

2753 commits