Yuhong Guo
b9e1977fae
Fix failure of test_free_objects_multi_node ( #3481 )
...
It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure.
The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client.
In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.
2018-12-06 15:55:49 -05:00
Eric Liang
412aaa5195
[tune] Deprecate ambiguous function values (use tune.function / tune.sample_from instead) ( #3457 )
...
* wip
* exclude
2018-12-06 11:35:20 -08:00
Eric Liang
d864f299d7
[rllib] fixes from dogfooding multi-agent ( #3456 )
...
auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler
add some Q-learning debug stats
report min, max of custom metrics
better errors
2018-12-05 23:31:45 -08:00
shane
7a79b7f62c
increase container memory and shm to 20G ( #3475 )
...
* increase container memory and shm to 20G
* variables are POWERFUL
2018-12-05 14:59:07 -08:00
Si-Yuan
2e6f9bedf2
Add the extra fallback for serialization ( #3468 )
...
* Add the extra fallback for serialization.
* Better comments & warnings. quotes.
* Update test/runtest.py
Co-Authored-By: suquark <suquark@gmail.com>
* Update test/runtest.py
Co-Authored-By: suquark <suquark@gmail.com>
* linting
* Don't hijack too much errors.
* simplify the test
* Update runtest.py
* simplify
2018-12-05 13:09:08 -08:00
Philipp Moritz
06f6431765
Make test_actor_multiple_gpus_from_multiple_tasks less stressful in travis
2018-12-04 17:44:33 -08:00
Eric Liang
93a9d32288
[docs] Switch docs to use rllib train instead of train.py
2018-12-04 17:36:06 -08:00
Richard Liaw
9d0bd50e78
[tune] Component notification on node failure + Tests ( #3414 )
...
Changes include:
- Notify Components on Requeue
- Slight refactoring of Node Failure handling
- Better tests
2018-12-04 14:47:31 -08:00
Eric Liang
ce355d13d4
[rllib] Allow envs to be auto-registered; add on_train_result callback with curriculum example ( #3451 )
...
* train step and docs
* debug
* doc
* doc
* fix examples
* fix code
* integration test
* fix
* ...
* space
* instance
* Update .travis.yml
* fix test
2018-12-03 23:15:43 -08:00
Kristian Hartikainen
be6567e6fd
Tweak/exec attach info ( #3447 )
...
* Add custom cluster name to exec info
* Update submit info to match exec info
2018-12-03 21:39:43 -08:00
Eric Liang
d8205976e8
[rllib] Auto clip actions to Box space range; deprecate squash_to_range ( #3426 )
...
* fix clip
* tweak wording
* remove squash entirely
* Update rllib-models.rst
* fix argument order
* Apply suggestions from code review
Co-Authored-By: ericl <ekhliang@gmail.com>
2018-12-03 19:55:25 -08:00
Eric Liang
7abfbfd2f7
[rllib] Better error message for unsupported non-atari image observation sizes ( #3444 )
2018-12-03 01:24:36 -08:00
Stephanie Wang
4abafd7e62
Fix bug in ray.wait ( #3445 )
...
ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely:
1. Objects A and B are put in the cluster.
2. Client calls ray.wait([A, B], num_returns=1).
3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each.
4. Callback for A fires. The wait completes and the request is removed.
5. Callback for B fires. The wait request no longer exists and raylet crashes.
2018-12-01 19:40:33 -08:00
Eric Liang
13c8ce4d84
Update README.rst with 0.6.0 version number. ( #3453 )
2018-12-01 19:16:45 -08:00
Philipp Moritz
c5b5cdae33
Upgrade Arrow to include Plasma TensorFlow Op release fix ( #3448 )
...
This includes a fix so the TensorFlow op releases memory properly (https://github.com/apache/arrow/pull/3061 ) and the possibility to store arrow data structures in plasma (https://github.com/apache/arrow/pull/2832 ).
https://github.com/ray-project/ray/issues/3404
2018-12-01 16:15:09 -08:00
Hao Chen
abd37df41e
Add stress test for Java worker ( #3424 )
2018-12-01 16:11:09 -08:00
Robert Nishihara
0603e0b73a
Bump version from 0.5.3 to 0.6.0. ( #3420 )
2018-12-01 11:39:36 -08:00
Devin Petersohn
57512616e1
Update readme to contain logo ( #3443 )
...
* Adding logo to readme
* Updating link
* Add badge
* Addressing comments
* Moving logo
* Change align
* Move image
2018-11-30 18:28:35 -08:00
GiliR4t1qbit
454d3aa07d
[docs] Snippet did not have a code-block tag above it ( #3442 )
2018-11-30 16:39:40 -08:00
Stephanie Wang
447604a9fe
Use actor ID for the dummy object ( #3437 )
2018-11-29 22:31:04 -08:00
Eric Liang
07d8cbf414
[rllib] Support batch norm layers ( #3369 )
...
* batch norm
* lint
* fix dqn/ddpg update ops
* bn model
* Update tf_policy_graph.py
* Update multi_gpu_impl.py
* Apply suggestions from code review
Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-29 13:33:39 -08:00
Devin Petersohn
4d2010a852
Ship Modin with Ray. ( #3109 )
2018-11-29 20:05:24 +01:00
Stephanie Wang
48a5935224
Fault tolerance for actor creation ( #3422 )
...
* Add regression test
* Request actor creation if no actor location found
* Comments
* Address comments
* Increase test timeout
* Trigger test
2018-11-29 10:48:35 -08:00
Chunyang Wen
fd7e494344
Remove: duplicate feed_dict constructing ( #3431 )
2018-11-29 10:21:46 -08:00
Kristian Hartikainen
7e319dbf0c
Automatically indent tune logger params ( #3399 )
2018-11-29 00:15:50 -08:00
Eric Liang
c46ea2ff4b
Click 0.7 changes the naming convention for commands; fix this
2018-11-28 14:59:58 -08:00
Tianming Xu
139fbf7884
Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory ( #3403 )
2018-11-27 23:51:18 -08:00
Robert Nishihara
82863b5251
[autoscaler] Update autoscaler to use heartbeat batches. ( #3409 )
2018-11-27 23:46:27 -08:00
Eric Liang
f0df97db6f
[rllib] example and docs on how to use parametric actions with DQN / PG algorithms ( #3384 )
2018-11-27 23:35:19 -08:00
Eric Liang
c2108ca64f
Don't put entire actor registry in debug string since it's too long ( #3395 )
2018-11-27 16:48:12 -08:00
Eric Liang
0d56fc10cc
Move setproctitle to ray[debug] package ( #3415 )
2018-11-27 09:50:59 -08:00
Robert Nishihara
20b8b1d891
Add script for running stress tests. ( #3378 )
...
* Add script for running stress tests.
* Add an actor tree test where actors die with some probability
* Improve test.
* Small fix
* Update tests.
* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus ( #3396 )
...
* frac ppo
* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang
aa94d3dd50
[autoscaler] Allow more than 5s from node creation to first heartbeat ( #3385 )
2018-11-26 17:25:05 -08:00
Robert Nishihara
0f0099fb90
UI changes, fix the task timeline and add the object transfer timeline to UI. ( #3397 )
...
* Saving
* Fix cmake and remove object/task search boxes.
* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang
b85e7b43f3
[rllib] Refactor the sampler ( #3387 )
...
* refactor
* fix test
* add perf test
* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara
3856533065
Fix incompatibility with most recent version of Redis. ( #3379 )
...
* Fix incompatibility with most recent version of Redis.
* Fix
* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
18a8dbfcfb
[rllib] Clip DDPG ou-noise to avoid exceeding action bounds ( #3386 )
...
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang
55fca828ce
[rllib] Fix use_lstm option when using custom model with dict space ( #3368 )
...
## What do these changes do?
This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.
## Related issue number
Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang
8b76bab25c
[rllib] docs for td3 ( #3381 )
...
* td3 doc
* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang
41b6b50d09
fix py3 ( #3382 )
2018-11-22 11:43:52 -08:00
GiliR4t1qbit
b9ae5edf74
When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised ( #3383 )
2018-11-22 09:42:58 -08:00
Jones Wong
24bfe8ab76
Enable Twin Delayed DDPG for RLlib DDPG agent ( #3353 )
2018-11-21 20:03:20 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache ( #3366 )
...
* Move children_ map inside Lineage
* Update lineage_cache.cc
* Test and fixes
* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance ( #3238 )
...
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239 )
Closes #2851 .
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death ( #3359 )
...
* Broadcast actor death, clean up dummy objects
* Reduce logging and clean up state when failing a task
* lint
* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
1a926c9b7c
Fix $MACOSX_DEPLOYMENT_TARGET ( #3337 )
2018-11-21 10:56:17 -08:00
Eric Liang
686cf20951
Remove uses of std::list::size ( #3358 )
...
* worker pool and client conn
* Fix linting
* unordered set
* move
2018-11-20 14:47:55 -08:00
Richard Liaw
c24d87b4d1
[autoscaler] Submit command ( #3312 )
2018-11-20 14:03:34 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient ( #3324 )
...
* put queues outside
* working version, still needs to be optimized
* implement round robin
* proper round robin
* fix spillback
* update
* fix
* cleanup
* more cleanups
* fix
* fix
* add documentation
* explanation for hash combiner
* speed it up
* cleanup and linting
* linting
* comments
* Update scheduling_queue.h
* temp commit
* fixes
* update
* fix
* cleanup
* cleanup
* lint
* more prints
* more prints
* increase sleep
* documentation
* sleep
* fix
* fix
* sleep longer
* update
* fix
* fix
* fix
* Add ordered_set container.
* Fix
* Linting
* Constructors
* Remove O(n) call to list.size().
* fixes
* use ordered set
* Fix.
* Add documentation.
* Add iterators to ordered_set container implementation.
* iterator_type -> iterator
* Make typedefs private
* Add const_iterator
* fix
* fix test
* linting
* lint
* update
* add documentation
* linting
2018-11-20 13:14:12 -08:00