Commit graph

2305 commits

Author SHA1 Message Date
Eric Liang
6bb1103930 [rllib] Avoid sample wastage with bad PPO configurations (#3552)
## What do these changes do?

Previously we logged a warning if the PPO configuration would waste many samples. However, this didn't apply in the case of long episodes in `complete_episodes` batch mode, and also the amount of waste is up to 2x in common cases.

This pr:
- Estimates the number of sampling tasks needed to avoid over-sampling.
- Collects all sample results and never discards any. In principle this can degrade performance at large scale if certain machines are slower. Add a config flag to enable this legacy behavior.

## Related issue number

Closes: https://github.com/ray-project/ray/issues/3549
2018-12-20 10:50:44 -08:00
Richard Liaw
ac48a58e4e
[tune] Reduce scope of variant generator (#3583)
This PR provides a better error message when the generate_variants code
breaks. Also removes a comment about nesting dependencies.

This comes mainly as a hotfix solution for #3466. We should leave that issue open for future contribution 🙂
2018-12-20 10:48:28 -08:00
Eric Liang
303883a3b6 [rllib] [rfc] add contrib module and guideline for merging (#3565)
This adds guidelines for merging code into `rllib/contrib` vs `rllib/agents`. Also, clean up the agent import code to make registration easier.
2018-12-20 10:44:34 -08:00
adoda
cf0c4745f4 [rllib] support running older version tensorflow(version < 1.5.0) (#3571) 2018-12-19 20:27:24 -08:00
Robert Nishihara
a5309bec7c Make README render properly on PyPI. (#3578)
* Make README render properly in pypi.

* Add small logo

* temporary fix

* smaller image

* Remove image size.

* Add author and email to setup.py.
2018-12-19 18:41:09 -08:00
Hao Chen
132a23354e Fix pending callback not called when ServerConnection destructs (#3572) 2018-12-19 17:29:36 -08:00
Eric Liang
ffa6ee3ec8
[rllib] streaming minibatching for IMPALA (#3402)
* mb impala

* fix

* paropt

* update

* cpu warn

* on cpu

* fix mb

* doc

* docs

* comment

* larger num

* early release

* remove grad clip

* only check loader count in multi gpu mode

* revert bad multigpu changes

* num sgd iter

* comment

* reuse optimizer

* add test

* par load test

* loosen test

* Update run_multi_node_tests.sh

* fix local mode

* Update agent.py
2018-12-19 02:23:29 -08:00
Alexey Tumanov
c4cba98c75 Remove deprecation warnings when running actor tests (#3563)
* remove deprecation warnings when running actor tests

* replacing logger.warn with logger.warning

* Update worker.py

* Update policy_client.py

* Update compression.py
2018-12-18 17:04:51 -08:00
Yuhong Guo
fb33fa9097 Enable function_descriptor in backend to replace the function_id (#3028) 2018-12-18 18:53:59 -05:00
Alexey Tumanov
3822b20319 [doc] update testing and dev instructions (#3562)
* [doc] update python testing command

* update installation/dev instructions
2018-12-18 14:45:24 -08:00
Stephanie Wang
26ca40817e Convert UniqueID::nil() to a constructor (#3564)
* Initialize UniqueID to nil

* Return reference to static const variable
2018-12-18 11:59:02 -08:00
Yuhong Guo
75ddf7cca4 Fix 2 small bugs (#3573) 2018-12-18 14:52:21 -05:00
Eric Liang
db0dee573e
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
YifengHuang
bc4aa85ea3 fix link in doc (#3567) 2018-12-18 00:10:55 -08:00
opherlieber
854b06854f remove auto-concat of rollouts in AsyncSampler (#3556)
* remove auto-concat of rollouts in AsyncSampler

* remote auto-concat test

* remove unused reference
2018-12-17 13:54:52 -08:00
Devin Petersohn
3833ba4e4b Bump modin version to 0.2.5 (#3553) 2018-12-17 14:36:47 -05:00
Tianming Xu
7767aba637 Note requirement cython==0.29.0 in installation instructions (#3555) 2018-12-17 20:43:47 +08:00
Robert Nishihara
417c7f2d6f Update arrow and remove plasma_manager references. (#3545) 2018-12-15 23:36:02 -08:00
Philipp Moritz
b3bf608608 Update arrow to reduce plasma IPCs. (#3497) 2018-12-14 23:49:37 -05:00
Stephanie Wang
fcc37021b2
Throw exception for ray.get of an evicted actor object (#3490)
* Add a flag for whether an object has been created before

* Add regression test

* doc

* Share object directory between object and node managers

* Treat evicted actor tasks as failed

* minor

* Check return value

* Fix bug where object locations weren't getting updated on client death

* Fix mac build

* Use RayTaskError
2018-12-14 11:41:27 -08:00
bibabolynn
7fd24e384b [java] Pass large args by reference (#3504) 2018-12-14 23:32:35 +08:00
Richard Liaw
de3fdeb5b5
[autoscaler] Fix Error Handling for botocore (#3534)
Unfortunately Boto generates error classes dynamically, so this catches
the expected error and raises the error if it is the wrong class.

Closes #3533.
2018-12-14 00:20:49 -08:00
Yuhong Guo
2a4685a08b Add a script to collect built thirdparty libs to avoid download and building again. (#3521) 2018-12-13 23:56:40 -08:00
Yuhong Guo
a4abe6c0fe Add test to test raylet client connection when raylet crashes. (#3518) 2018-12-13 23:40:50 -08:00
Hao Chen
e7b51cbd1b [xray] Implement Actor Reconstruction (#3332)
* Implement Actor Reconstruction

* fix

* fix actor handle __del__

* fix lint

* add comment

* Remove actorCreationDummyObjectId

* address comments

* fix

* address comments

* avoid copy

* change log to debug

* fix error name
2018-12-13 21:28:58 -08:00
Alexey Tumanov
2455de78ce save initial config instead of initial resource config (#3532) 2018-12-13 20:39:42 -08:00
Si-Yuan
84fae57ab5 Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511)
* refactoring

* fix bugs

* create client class

* create client class for java; bug fix

* remove legacy code

* improve code by using std::string, std::unique_ptr rename private fields and removing legacy code

* rename class

* improve naming

* fix

* rename files

* fix names

* change name

* change return types

* make a mutex private field

* fix comments

* fix bugs

* lint

* bug fix

* bug fix

* move too short functions into the header file

* Loose crash conditions for some APIs.

* Apply suggestions from code review

Co-Authored-By: suquark <suquark@gmail.com>

* format

* update

* rename python APIs

* fix java

* more fixes

* change types of cpython interface

* more fixes

* improve error processing

* improve error processing for java wrapper

* lint

* fix java

* make fields const

* use pointers for [out] parameters

* fix java & error msg

* fix resource leak, etc.
2018-12-13 13:39:10 -08:00
Chunyang Wen
5dcc333199 [sgd] Modify: add interface for model (#3458)
* Modify: add interface for model

* Modify: remove single quota and build; add metrics

* Modify: flatten into list of dict

* Update distributed_sgd.rst

* Modify: update format with scripts/format.sh

* Update sgd_worker.py
2018-12-12 21:23:25 -08:00
Eric Liang
0e00533ed4
Different approach to removing RayGetError (#3471) 2018-12-12 20:30:51 -08:00
Eric Liang
20c7fad4f4
Move actor table to primary redis context 2018-12-12 16:51:29 -08:00
Eric Liang
32473cf22e
[rllib] Basic Offline Data IO API (#3473) 2018-12-12 13:57:48 -08:00
Richard Liaw
cc8f7db246
[docs] Improve cluster/docker docs (#3517)
- Surfaces local cluster usage
 - Increases visability of these instructions
 - Removes some docker docs (that are really out of scope for Ray
 documentation IMO)

Closes #3517.
2018-12-12 10:40:54 -08:00
Eric Liang
5f4a9cc713 [rllib] Rollout should preprocess observations; some cleanups (#3512)
<!--
Thank you for your contribution!

Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->

## What do these changes do?

From https://groups.google.com/forum/#!topic/ray-dev/u-gybKK6-Ns
2018-12-11 20:16:38 -08:00
Eric Liang
59f4743f20
[rllib] Run simple regressions tests for all algs in jenkins (#3498) 2018-12-11 17:21:53 -08:00
Richard Liaw
e0fbb68e47
[tune] Custom Logging, Trial Name (#3465)
Adds support for custom loggers, custom trial strings, and custom sync commands. Closes #3034, #2985, and #3390.
2018-12-11 13:41:59 -08:00
Robert Nishihara
74c3370bd5 Show slowest tests in travis. (#3507) 2018-12-11 11:25:04 -08:00
Eric Liang
52df4dfc6f
[rllib] Fix multiagent_two_trainer test (#3509)
* update

* fix

* dict ordre

* fix

* fix
2018-12-11 00:16:39 -08:00
Richard Liaw
1f4a01cff6 [tune] Fix PyTorch example after PyTorch v1 (#3500)
* [tune]

* fix

* lint

* fix
2018-12-10 12:00:53 -08:00
Eric Liang
962f18756b [autoscaler] Use fixed timestamp to check against health timeouts (#3503) 2018-12-10 14:58:27 -05:00
Yuhong Guo
abd781d607 Make stress test time shorter. (#3506) 2018-12-10 14:46:40 -05:00
Eric Liang
ce388a45cf
[rllib] Learner should not see clipped actions (#3496) 2018-12-09 21:57:11 -08:00
Philipp Moritz
87c0d24579
[sgd] Add file lock to protect compilation of sgd op (#3486)
* add file lock to protect compilation of sgd op

* lint

* update

* fix

* fix

* lint

* update

* rebase on arrow

* Update sgd_worker.py
2018-12-09 13:52:40 -08:00
Eric Liang
cffe8f9806 Add option to evict keys LRU from the sharded redis tables (#3499)
* wip

* wip

* format

* wip

* note

* lint

* fix

* flag

* typo

* raise timeout

* fix

* optional get

* fix flag

* increase timeout in test

* update docs

* format
2018-12-09 05:48:52 -08:00
Yuhong Guo
0136af5aac Add return value for recontruction RPC. (#3493)
* Add return value for recontruct RPC.

* Fix comment function name
2018-12-09 00:08:44 -08:00
Eric Liang
7aec357501
[rllib] Multi-GPU support for Multi-Agent PPO (#3479)
* wip

* fix

* remove check

* fix null

* revert

* lint and kl

* also fix rollout
2018-12-08 18:02:33 -08:00
Eric Liang
8b5827b9da
[rllib] Better document which methods are abstract and which ones are overrides (#3480) 2018-12-08 16:28:58 -08:00
Eric Liang
462e6ef066
[rllib] Use smoothed version of collect metrics for DQN (#3491)
* fix

* lint
2018-12-07 18:36:23 -08:00
Tianming Xu
f6490f9bef Resolve no handlers could be found for logger 'ray.worker' when importing ray (#3483) 2018-12-06 20:46:53 -08:00
Eric Liang
8395523f81
[rllib] Copy data before passing to Ape-X learner thread (fixes transient plasma crashes) (#3484) 2018-12-06 18:01:11 -08:00
Si-Yuan
c2c501bbe6 Experimental asyncio support (#2015)
* Init commit for async plasma client

* Create an eventloop model for ray/plasma

* Implement a poll-like selector base on `ray.wait`. Huge improvements.

* Allow choosing workers & selectors

* remove original design

* initial implementation of epoll-like selector for plasma

* Add a param for `worker` used in `PlasmaSelectorEventLoop`

* Allow accepting a `Future` which returns object_id

* Do not need `io.py` anymore

* Create a basic testing model

* fix: `ray.wait` returns tuple of lists

* fix a few bugs

* improving performance & bug fixing

* add test

* several improvements & fixing

* fix relative import

* [async] change code format, remove old files

* [async] Create context wrapper for the eventloop

* [async] fix: context should return a value

* [async] Implement futures grouping

* [async] Fix bugs & replace old functions

* [async] Fix bugs found in tests

* [async] Implement `PlasmaEpoll`

* [async] Make test faster, add tests for epoll

* [async] Fix code format

* [async] Add comments for main code.

* [async] Fix import path.

* [async] Fix test.

* [async] Compatibility.

* [async] less verbose to not annoy the CI.

* [async] Add test for new API

* [async] Allow showing debug info in some of the test.

* [async] Fix test.

* [async] Proper shutdown.

* [async] Lint~

* [async] Move files to experimental and create API

* [async] Use async/await syntax

* [async] Fix names & styles

* [async] comments

* [async] bug fixing & use pytest

* [async] bug fixing & change tests

* [async] use logger

* [async] add tests

* [async] lint

* [async] type checking

* [async] add more tests

* [async] fix bugs on waiting a future while timeout. Add more docs.

* [async] Formal docs.

* [async] Add typing info since these codes are compatible with py3.5+.

* [async] Documents.

* [async] Lint.

* [async] Fix deprecated call.

* [async] Fix deprecated call.

* [async] Implement a more reasonable way for dealing with pending inputs.

* [async] Fix docs

* [async] Lint

* [async] Fix bug: Type for time

* [async] Set our eventloop as the default eventloop so that we can get it through `asyncio.get_event_loop()`.

* [async] Update test & docs.

* [async] Lint.

* [async] Temporarily print more debug info.

* [async] Use `Poll` as a default option.

* [async] Limit resources.

* new async implementation for Ray

* implement linked list

* bug fix

* update

* support seamless async operations

* update

* update API

* fix tests

* lint

* bug fix

* refactor names

* improve doc

* properly shutdown async_api

* doc

* Change the table on the index page.

* Adjust table size.

* Only keeps `as_future`.

* change how we init connection

* init connection in `ray.worker.connect`

* doc

* fix

* Move initialization code into the module.

* Fix docs & code

* Update pyarrow version.

* lint

* Restore index.rst

* Add known issues.

* Apply suggestions from code review

Co-Authored-By: suquark <suquark@gmail.com>

* rename

* Update async_api.rst

* Update async_api.py

* Update async_api.rst

* Update async_api.py

* Update worker.py

* Update async_api.rst

* fix tests

* lint

* lint

* replace the magic number
2018-12-06 17:39:05 -08:00