Commit graph

1211 commits

Author SHA1 Message Date
Robert Nishihara
c9d70f0dda Remove num_local_schedulers argument from ray.worker._init. (#3704)
* Remove num_local_schedulers argument from ray.worker._init.

* Fix

* Fix tests.
2019-01-07 12:44:49 -08:00
Eric Liang
e78562b2e8
[rllib] Misc fixes: set lr for PG, better error message for LSTM/PPO, fix multi-agent/APEX (#3697)
* fix

* update test

* better error

* compute

* eps fix

* add get_policy() api

* Update agent.py

* better err msg

* fix

* pass in rew
2019-01-06 19:37:35 -08:00
Richard Liaw
8934e37a78
[tune] Change log handling for Tune (#3661)
Also provides a small retry mechanism for a transient error as reported
by #3340.

Closes #3653.
2019-01-06 13:20:10 -08:00
mattearllongshot
681e8cd3fd [autoscaler] Add an initial_workers option (#3530)
## What do these changes do?

    This option goes along with `min_workers`, and `max_workers`.  When the
    cluster is first brought up (or when it is refreshed with a subsequent
    `ray up`) this number of nodes will be started.
    
    It's a workaround for issues of scaling (see related issues) where it
    can take a long time (or forever in the case where the head node has
    `--num-cpus 0`) to scale up a cluster in response to increasing demand.


## Related issue number

Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106
2019-01-05 17:58:42 -08:00
Robert Nishihara
067976ad3d Push a warning to all users when large number of workers have been started. (#3645)
* Push a warning to all users when large number of workers have been started.

* Add test.

* Fix bug.

* Give warning when worker starts instead of when worker registers.

* Fix

* Fix tests
2019-01-05 13:27:32 -08:00
Eric Liang
03fe760616
[rllib] Model self loss isn't included in all algorithms (#3679) 2019-01-04 22:30:35 -08:00
Richard Liaw
960a943503
[tune] Fault Tolerance: handle lost checkpoints by restart (#3657)
Checks that node failure with lost checkpoints does not crash. Also adds test.
2019-01-04 22:05:27 -08:00
Eric Liang
7db1f3be2a [tune] resume=False by default but print a tip to set resume="prompt" + jenkins fix (#3681) 2019-01-04 17:23:19 -08:00
Kristian Hartikainen
747b117929 [tune] Tweak/allow nested pbt mutations (#3455)
* Fix warning text in pbt logger

* Allow nested mutations in pbt by recursing explore function

* Add test for nested pbt mutation

* Update pbt explore to only call custom explore on top level

* fix test
2019-01-04 13:51:11 -08:00
Robert Nishihara
cd80891ddb Try to figure out the memory limit in a docker container. (#3605)
* Try to figure out the memory limit in a docker container.

* Update comment

* Fix

* Fix
2019-01-03 23:07:24 -08:00
Robert Nishihara
586a5c9ffa Limit default redis max memory to 10GB. (#3630)
* Limit Redis max memory to 10GB/shard by default.

* Update stress tests.

* Reorganize

* Update

* Add minimum cap size for object store and redis.

* Small test update.
2019-01-03 13:23:54 -08:00
Yuhong Guo
4b23a34c93 Fix multi-thread problem of function manager and Jenkins test (#3648) 2019-01-03 17:05:13 +08:00
Eric Liang
ca864faece
[rllib] Documentation for I/O API and multi-agent support / cleanup (#3650) 2019-01-03 15:15:36 +08:00
opherlieber
2177e2f410 [rllib] Agent: Allow unknown subkeys for custom_resources_per_worker (#3639)
* RLLib Agent: Allow unknown subkeys for custom_resources_per_worker

* Update agent.py
2019-01-03 14:19:59 +08:00
Eric Liang
47d36d7bd6
[rllib] Refactor pytorch custom model support (#3634) 2019-01-03 13:48:33 +08:00
Robert Nishihara
b6bcd18d65 Split profile table among many keys in the GCS. (#3676)
* Divide profile table among many keys in GCS.

* Fix, and remove --collect-profiling-data arg.

* Remove reference in doc.
2019-01-02 21:33:01 -08:00
Si-Yuan
93d54110f8 Prevent overriding faulthandler settings (#3668)
This change ensures that Ray set up fault handlers only if it has not been enabled by other applications. Otherwise some applications could face strange issues when using Ray, and some unittests using xml runners will fail.
2018-12-31 16:36:26 -08:00
Yuhong Guo
c9b8ecca51 Add RayParams to refactor the parameters used by ray python. (#3558) 2018-12-29 22:04:27 +08:00
Devin Petersohn
eb1e5fa2cf Fixing Python2 compatibility issues. Adding inline docs (#3656) 2018-12-28 22:53:28 -08:00
Richard Liaw
aad3c50e2d
[tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Richard Liaw
ac792d70c8
[rllib] Add starcraft multiagent env as example (#3542) 2018-12-27 10:00:32 +08:00
Tianming Xu
b4f61dfd50 [rllib] Export policy model checkpoint (#3637)
* Export policy model checkpoint

* update comment
2018-12-27 08:43:06 +09:00
Richard Liaw
6e2d7a9ba1 [tune] Support Configuration Merging (#3584)
* merge configs

* deep merge

* lint

* add resolve

* test
2018-12-26 20:07:11 +09:00
Stan Wang
4ce3818be5 Average aggregated gradients before put in plasma store (#3631) 2018-12-26 20:03:11 +09:00
Yuhong Guo
1b98fb8238 Fix Jenkins test failures and function descriptor bug. (#3569)
## What do these changes do?
1. Fix the Jenkins test failure by add driver id to Actor GCS Key.
2. Move `object_manager_test.py` from Jenkins to Travis.
2018-12-25 23:31:44 -08:00
Robert Nishihara
5426234cd8 Update documentation to reflect 0.6.1 release. (#3622) 2018-12-24 11:10:04 -08:00
nam-cern
3d8f56409b Ensure numpy is at least 1.10.4 in setup.py (#2462)
In the build script, numpy is specifically set at 1.10.4. We should also ensure that it is indeed the case in `setup.py`.
2018-12-24 11:01:25 -08:00
Eric Liang
9f63119a83
[rllib] Allow development without needing to compile Ray (#3623)
* wip

* lint

* wip

* wip

* rename

* wip

* Cleaner handling of cli prompt
2018-12-24 18:08:23 +09:00
Devin Petersohn
c13b2685f5 [modin] Append to path to avoid namespace collision on development branches (#3621) 2018-12-23 23:58:56 -08:00
Alexey Tumanov
9b8d7573fe bump version from 0.6.0 to 0.6.1 (#3610) 2018-12-23 17:03:42 -08:00
Robert Nishihara
bb7ca3bae7 Upgrade flatbuffers version to 1.10.0. (#3559)
* Upgrade flatbuffers version to 1.10.0.

* Temporarily change ray.utils.decode for backwards compatibility.
2018-12-23 14:56:34 -08:00
Tianming Xu
deb26b954e [rllib] Export tensorflow model of policy graph (#3585)
* Export tensorflow model of policy graph

* Add tests,examples,pydocs and infer extra signatures from existing methods

* Add example usage in export_policy_model comment

* Fix lint error

* Fix lint error

* Fix lint error
2018-12-22 17:35:25 +09:00
Eric Liang
ddc97864df [rllib] Add requested clarifications to test requirement of contrib docs (#3589) 2018-12-21 11:02:02 -08:00
Richard Liaw
e046a5c767
[tune] resources_per_trial from trial_resources (#3580)
Renaming variable due to user errors.
2018-12-20 19:00:47 -08:00
Devin Petersohn
a174a46e02 Allowing multiple users to access the /tmp/ray file at the same time (#3591)
* Allowing multiple users to access the /tmp/ray file at the same time

Previous sequence that caused this issue:
* User A starts ray with `ray.init` when /tmp/ray does not exist
* User B starts ray with `ray.init` and /tmp/ray now exists

User B will get a permissions error
Checking the permissions, /tmp/ray is 700

I have identified a race condition in `try_to_create_directory`
* Multiple processes try to create /tmp/ray at the same time
* chmod is either silently erroring or working properly within the race condition

Resolution: Move chmod outside of the check for whether the directory exists or not.

* Adding try except for users who do not own the directory
2018-12-20 18:46:54 -08:00
Stephanie Wang
34bab6291c
Cleanup actor handle pickling code (#3560)
* Cleanup actor handle pickling code

* remove unused

* fix

* lint
2018-12-20 16:37:21 -08:00
Eric Liang
6bb1103930 [rllib] Avoid sample wastage with bad PPO configurations (#3552)
## What do these changes do?

Previously we logged a warning if the PPO configuration would waste many samples. However, this didn't apply in the case of long episodes in `complete_episodes` batch mode, and also the amount of waste is up to 2x in common cases.

This pr:
- Estimates the number of sampling tasks needed to avoid over-sampling.
- Collects all sample results and never discards any. In principle this can degrade performance at large scale if certain machines are slower. Add a config flag to enable this legacy behavior.

## Related issue number

Closes: https://github.com/ray-project/ray/issues/3549
2018-12-20 10:50:44 -08:00
Richard Liaw
ac48a58e4e
[tune] Reduce scope of variant generator (#3583)
This PR provides a better error message when the generate_variants code
breaks. Also removes a comment about nesting dependencies.

This comes mainly as a hotfix solution for #3466. We should leave that issue open for future contribution 🙂
2018-12-20 10:48:28 -08:00
Eric Liang
303883a3b6 [rllib] [rfc] add contrib module and guideline for merging (#3565)
This adds guidelines for merging code into `rllib/contrib` vs `rllib/agents`. Also, clean up the agent import code to make registration easier.
2018-12-20 10:44:34 -08:00
adoda
cf0c4745f4 [rllib] support running older version tensorflow(version < 1.5.0) (#3571) 2018-12-19 20:27:24 -08:00
Robert Nishihara
a5309bec7c Make README render properly on PyPI. (#3578)
* Make README render properly in pypi.

* Add small logo

* temporary fix

* smaller image

* Remove image size.

* Add author and email to setup.py.
2018-12-19 18:41:09 -08:00
Eric Liang
ffa6ee3ec8
[rllib] streaming minibatching for IMPALA (#3402)
* mb impala

* fix

* paropt

* update

* cpu warn

* on cpu

* fix mb

* doc

* docs

* comment

* larger num

* early release

* remove grad clip

* only check loader count in multi gpu mode

* revert bad multigpu changes

* num sgd iter

* comment

* reuse optimizer

* add test

* par load test

* loosen test

* Update run_multi_node_tests.sh

* fix local mode

* Update agent.py
2018-12-19 02:23:29 -08:00
Alexey Tumanov
c4cba98c75 Remove deprecation warnings when running actor tests (#3563)
* remove deprecation warnings when running actor tests

* replacing logger.warn with logger.warning

* Update worker.py

* Update policy_client.py

* Update compression.py
2018-12-18 17:04:51 -08:00
Yuhong Guo
fb33fa9097 Enable function_descriptor in backend to replace the function_id (#3028) 2018-12-18 18:53:59 -05:00
Yuhong Guo
75ddf7cca4 Fix 2 small bugs (#3573) 2018-12-18 14:52:21 -05:00
Eric Liang
db0dee573e
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
opherlieber
854b06854f remove auto-concat of rollouts in AsyncSampler (#3556)
* remove auto-concat of rollouts in AsyncSampler

* remote auto-concat test

* remove unused reference
2018-12-17 13:54:52 -08:00
Robert Nishihara
417c7f2d6f Update arrow and remove plasma_manager references. (#3545) 2018-12-15 23:36:02 -08:00
Philipp Moritz
b3bf608608 Update arrow to reduce plasma IPCs. (#3497) 2018-12-14 23:49:37 -05:00
Richard Liaw
de3fdeb5b5
[autoscaler] Fix Error Handling for botocore (#3534)
Unfortunately Boto generates error classes dynamically, so this catches
the expected error and raises the error if it is the wrong class.

Closes #3533.
2018-12-14 00:20:49 -08:00