hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 05:16:49 -04:00

Author	SHA1	Message	Date
Robert Nishihara	1480f309c3	[doc] Replace runtest.py with mini_test.py in documentation. (#3750 ) Rename `xray_test.py` to `mini_test.py` and use that in the documentation. Right now we suggest that people run `runtest.py`, but that often doesn't succeed and takes too long.	2019-01-12 14:05:28 -08:00
Robert Nishihara	8723d6b061	Define a Node class to manage Ray processes. (#3733 ) * Implement Node class and move most of services.py into it. * Wait for nodes as they are added to the cluster. * Fix Redis authentication bug. * Fix bug in client table ordering. * Address comments. * Kill raylet before plasma store in test. * Minor	2019-01-11 22:30:38 -08:00
Hao Chen	597abb24ea	Refine multi-threading support (#3672 ) * [Python] refine multi-threading support fix * [java] refine multithreading code fix java * format	2019-01-10 13:58:11 -08:00
Stephanie Wang	04f31db54d	Actor dummy object garbage collection (#3593 ) * Convert UniqueID::nil() to a constructor * Cleanup actor handle pickling code * Add new actor handles to the task spec * Pass in new actor handles * Add new handles to the actor registration * Regression test for actor handle forking and GC * lint and doc * Handle pickled actor handles in the backend and some refactoring * Add regression test for dummy object GC and pickled actor handles * Check for duplicate actor tasks on submission * Regression test for forking twice, fix failed named actor leak * Fix bug for forking twice * lint * Revert "Fix bug for forking twice" This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac. * Add new actor handles when task is assigned, not finished * Remove comment * remove UniqueID() * Updates * update * fix * fix java * fixes * fix	2019-01-09 10:37:11 -08:00
Robert Nishihara	d1e21b702e	Change timeout from milliseconds to seconds in ray.wait. (#3706 ) * Change timeout from milliseconds to seconds in ray.wait. * Suppress warning. * Suppress warning. * Add prominent warning in API documentation.	2019-01-08 21:32:08 -08:00
Peter Schafhalter	5945b92fd3	[sgd] Add checkpointing (#3638 )	2019-01-08 15:29:30 -08:00
Robert Nishihara	5e76d52868	Improve cluster.wait_for_nodes() API. (#3712 ) * Separate out functionality for querying client table and improve cluster.wait_for_nodes() API. * Linting * Add back logging statements. * info -> debug	2019-01-07 21:26:58 -08:00
Robert Nishihara	c9d70f0dda	Remove num_local_schedulers argument from ray.worker._init. (#3704 ) * Remove num_local_schedulers argument from ray.worker._init. * Fix * Fix tests.	2019-01-07 12:44:49 -08:00
Hao Chen	df0733cafb	Skip test_multiple_recursive (#3683 ) This test often hangs or fails in CI. Skip it for now to unblock other PRs.	2019-01-06 13:24:29 -08:00
mattearllongshot	681e8cd3fd	[autoscaler] Add an initial_workers option (#3530 ) ## What do these changes do? This option goes along with `min_workers`, and `max_workers`. When the cluster is first brought up (or when it is refreshed with a subsequent `ray up`) this number of nodes will be started. It's a workaround for issues of scaling (see related issues) where it can take a long time (or forever in the case where the head node has `--num-cpus 0`) to scale up a cluster in response to increasing demand. ## Related issue number Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106	2019-01-05 17:58:42 -08:00
Robert Nishihara	067976ad3d	Push a warning to all users when large number of workers have been started. (#3645 ) * Push a warning to all users when large number of workers have been started. * Add test. * Fix bug. * Give warning when worker starts instead of when worker registers. * Fix * Fix tests	2019-01-05 13:27:32 -08:00
Robert Nishihara	586a5c9ffa	Limit default redis max memory to 10GB. (#3630 ) * Limit Redis max memory to 10GB/shard by default. * Update stress tests. * Reorganize * Update * Add minimum cap size for object store and redis. * Small test update.	2019-01-03 13:23:54 -08:00
Eric Liang	47d36d7bd6	[rllib] Refactor pytorch custom model support (#3634 )	2019-01-03 13:48:33 +08:00
Yuhong Guo	c9b8ecca51	Add RayParams to refactor the parameters used by ray python. (#3558 )	2018-12-29 22:04:27 +08:00
Richard Liaw	aad3c50e2d	[tune] Cluster Fault Tolerance (#3309 ) This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes. Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.	2018-12-29 11:42:25 +08:00
Hao Chen	62af2f25be	Fix test_multiple_actor_reconstruction failure (#3641 ) * Fix test_multiple_actor_reconstruction failure * add comment	2018-12-27 13:57:52 -08:00
Yuhong Guo	1b98fb8238	Fix Jenkins test failures and function descriptor bug. (#3569 ) ## What do these changes do? 1. Fix the Jenkins test failure by add driver id to Actor GCS Key. 2. Move `object_manager_test.py` from Jenkins to Travis.	2018-12-25 23:31:44 -08:00
Robert Nishihara	5426234cd8	Update documentation to reflect 0.6.1 release. (#3622 )	2018-12-24 11:10:04 -08:00
Eric Liang	303883a3b6	[rllib] [rfc] add contrib module and guideline for merging (#3565 ) This adds guidelines for merging code into `rllib/contrib` vs `rllib/agents`. Also, clean up the agent import code to make registration easier.	2018-12-20 10:44:34 -08:00
Eric Liang	ffa6ee3ec8	[rllib] streaming minibatching for IMPALA (#3402 ) * mb impala * fix * paropt * update * cpu warn * on cpu * fix mb * doc * docs * comment * larger num * early release * remove grad clip * only check loader count in multi gpu mode * revert bad multigpu changes * num sgd iter * comment * reuse optimizer * add test * par load test * loosen test * Update run_multi_node_tests.sh * fix local mode * Update agent.py	2018-12-19 02:23:29 -08:00
Yuhong Guo	75ddf7cca4	Fix 2 small bugs (#3573 )	2018-12-18 14:52:21 -05:00
Eric Liang	db0dee573e	[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548 )	2018-12-18 10:40:01 -08:00
Philipp Moritz	b3bf608608	Update arrow to reduce plasma IPCs. (#3497 )	2018-12-14 23:49:37 -05:00
Stephanie Wang	fcc37021b2	Throw exception for `ray.get` of an evicted actor object (#3490 ) * Add a flag for whether an object has been created before * Add regression test * doc * Share object directory between object and node managers * Treat evicted actor tasks as failed * minor * Check return value * Fix bug where object locations weren't getting updated on client death * Fix mac build * Use RayTaskError	2018-12-14 11:41:27 -08:00
Yuhong Guo	a4abe6c0fe	Add test to test raylet client connection when raylet crashes. (#3518 )	2018-12-13 23:40:50 -08:00
Hao Chen	e7b51cbd1b	[xray] Implement Actor Reconstruction (#3332 ) * Implement Actor Reconstruction * fix * fix actor handle __del__ * fix lint * add comment * Remove actorCreationDummyObjectId * address comments * fix * address comments * avoid copy * change log to debug * fix error name	2018-12-13 21:28:58 -08:00
Si-Yuan	84fae57ab5	Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511 ) * refactoring * fix bugs * create client class * create client class for java; bug fix * remove legacy code * improve code by using std::string, std::unique_ptr rename private fields and removing legacy code * rename class * improve naming * fix * rename files * fix names * change name * change return types * make a mutex private field * fix comments * fix bugs * lint * bug fix * bug fix * move too short functions into the header file * Loose crash conditions for some APIs. * Apply suggestions from code review Co-Authored-By: suquark <suquark@gmail.com> * format * update * rename python APIs * fix java * more fixes * change types of cpython interface * more fixes * improve error processing * improve error processing for java wrapper * lint * fix java * make fields const * use pointers for [out] parameters * fix java & error msg * fix resource leak, etc.	2018-12-13 13:39:10 -08:00
Eric Liang	0e00533ed4	Different approach to removing RayGetError (#3471 )	2018-12-12 20:30:51 -08:00
Eric Liang	32473cf22e	[rllib] Basic Offline Data IO API (#3473 )	2018-12-12 13:57:48 -08:00
Eric Liang	59f4743f20	[rllib] Run simple regressions tests for all algs in jenkins (#3498 )	2018-12-11 17:21:53 -08:00
Richard Liaw	e0fbb68e47	[tune] Custom Logging, Trial Name (#3465 ) Adds support for custom loggers, custom trial strings, and custom sync commands. Closes #3034, #2985, and #3390.	2018-12-11 13:41:59 -08:00
Yuhong Guo	abd781d607	Make stress test time shorter. (#3506 )	2018-12-10 14:46:40 -05:00
Eric Liang	ce388a45cf	[rllib] Learner should not see clipped actions (#3496 )	2018-12-09 21:57:11 -08:00
Eric Liang	cffe8f9806	Add option to evict keys LRU from the sharded redis tables (#3499 ) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format	2018-12-09 05:48:52 -08:00
Yuhong Guo	b9e1977fae	Fix failure of test_free_objects_multi_node (#3481 ) It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure. The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client. In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.	2018-12-06 15:55:49 -05:00
shane	7a79b7f62c	increase container memory and shm to 20G (#3475 ) * increase container memory and shm to 20G * variables are POWERFUL	2018-12-05 14:59:07 -08:00
Si-Yuan	2e6f9bedf2	Add the extra fallback for serialization (#3468 ) * Add the extra fallback for serialization. * Better comments & warnings. quotes. * Update test/runtest.py Co-Authored-By: suquark <suquark@gmail.com> * Update test/runtest.py Co-Authored-By: suquark <suquark@gmail.com> * linting * Don't hijack too much errors. * simplify the test * Update runtest.py * simplify	2018-12-05 13:09:08 -08:00
Philipp Moritz	06f6431765	Make test_actor_multiple_gpus_from_multiple_tasks less stressful in travis	2018-12-04 17:44:33 -08:00
Eric Liang	13c8ce4d84	Update README.rst with 0.6.0 version number. (#3453 )	2018-12-01 19:16:45 -08:00
Eric Liang	07d8cbf414	[rllib] Support batch norm layers (#3369 ) * batch norm * lint * fix dqn/ddpg update ops * bn model * Update tf_policy_graph.py * Update multi_gpu_impl.py * Apply suggestions from code review Co-Authored-By: ericl <ekhliang@gmail.com>	2018-11-29 13:33:39 -08:00
Stephanie Wang	48a5935224	Fault tolerance for actor creation (#3422 ) * Add regression test * Request actor creation if no actor location found * Comments * Address comments * Increase test timeout * Trigger test	2018-11-29 10:48:35 -08:00
Robert Nishihara	82863b5251	[autoscaler] Update autoscaler to use heartbeat batches. (#3409 )	2018-11-27 23:46:27 -08:00
Eric Liang	f0df97db6f	[rllib] example and docs on how to use parametric actions with DQN / PG algorithms (#3384 )	2018-11-27 23:35:19 -08:00
Robert Nishihara	20b8b1d891	Add script for running stress tests. (#3378 ) * Add script for running stress tests. * Add an actor tree test where actors die with some probability * Improve test. * Small fix * Update tests. * Minor change	2018-11-27 04:28:02 -08:00
Eric Liang	e3c088fa1e	[rllib] PPO doesn't work with fractional num gpus (#3396 ) * frac ppo * gpu test	2018-11-27 01:14:10 -08:00
Robert Nishihara	3856533065	Fix incompatibility with most recent version of Redis. (#3379 ) * Fix incompatibility with most recent version of Redis. * Fix * Fixes.	2018-11-24 16:36:38 -08:00
Eric Liang	55fca828ce	[rllib] Fix use_lstm option when using custom model with dict space (#3368 ) ## What do these changes do? This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation. ## Related issue number Closes https://github.com/ray-project/ray/issues/3367	2018-11-23 22:51:08 -08:00
Stephanie Wang	6b3236349c	Fix memory leak in lineage cache (#3366 ) * Move children_ map inside Lineage * Update lineage_cache.cc * Test and fixes * Remove unused	2018-11-21 16:18:39 -08:00
Richard Liaw	784a6399b0	[tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.	2018-11-21 12:38:16 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00

1 2 3 4 5 ...

596 commits