Commit graph

2354 commits

Author SHA1 Message Date
Richard Liaw
960a943503
[tune] Fault Tolerance: handle lost checkpoints by restart (#3657)
Checks that node failure with lost checkpoints does not crash. Also adds test.
2019-01-04 22:05:27 -08:00
Eric Liang
7db1f3be2a [tune] resume=False by default but print a tip to set resume="prompt" + jenkins fix (#3681) 2019-01-04 17:23:19 -08:00
Kristian Hartikainen
747b117929 [tune] Tweak/allow nested pbt mutations (#3455)
* Fix warning text in pbt logger

* Allow nested mutations in pbt by recursing explore function

* Add test for nested pbt mutation

* Update pbt explore to only call custom explore on top level

* fix test
2019-01-04 13:51:11 -08:00
Robert Nishihara
cd80891ddb Try to figure out the memory limit in a docker container. (#3605)
* Try to figure out the memory limit in a docker container.

* Update comment

* Fix

* Fix
2019-01-03 23:07:24 -08:00
Robert Nishihara
586a5c9ffa Limit default redis max memory to 10GB. (#3630)
* Limit Redis max memory to 10GB/shard by default.

* Update stress tests.

* Reorganize

* Update

* Add minimum cap size for object store and redis.

* Small test update.
2019-01-03 13:23:54 -08:00
Yuhong Guo
4b23a34c93 Fix multi-thread problem of function manager and Jenkins test (#3648) 2019-01-03 17:05:13 +08:00
Yuhong Guo
ad2287ebe9 Fix new boost libs failure in cache-lib mode and add test to cover collect_dependent_libs.sh (#3627)
* Fix building breaks and add lib collection to Travis.

* Fix arrow build

* Fix version mismatch problem
2019-01-02 23:51:11 -08:00
Eric Liang
ca864faece
[rllib] Documentation for I/O API and multi-agent support / cleanup (#3650) 2019-01-03 15:15:36 +08:00
opherlieber
2177e2f410 [rllib] Agent: Allow unknown subkeys for custom_resources_per_worker (#3639)
* RLLib Agent: Allow unknown subkeys for custom_resources_per_worker

* Update agent.py
2019-01-03 14:19:59 +08:00
Eric Liang
47d36d7bd6
[rllib] Refactor pytorch custom model support (#3634) 2019-01-03 13:48:33 +08:00
Robert Nishihara
b6bcd18d65 Split profile table among many keys in the GCS. (#3676)
* Divide profile table among many keys in GCS.

* Fix, and remove --collect-profiling-data arg.

* Remove reference in doc.
2019-01-02 21:33:01 -08:00
Yuhong Guo
93e9d2b82c Improve backend log: env variable setting and format refine. (#3662)
* Improve backend logging

* Address comment

* Fix Raul's comment
2019-01-01 21:45:29 -08:00
Eric Liang
b8a9e3f106
[rllib] Remove uses of sgd_stepsize => lr (#3667)
* lr

* Update example-evolution-strategies.rst
2019-01-01 12:01:27 +08:00
Si-Yuan
93d54110f8 Prevent overriding faulthandler settings (#3668)
This change ensures that Ray set up fault handlers only if it has not been enabled by other applications. Otherwise some applications could face strange issues when using Ray, and some unittests using xml runners will fail.
2018-12-31 16:36:26 -08:00
Yuhong Guo
c9b8ecca51 Add RayParams to refactor the parameters used by ray python. (#3558) 2018-12-29 22:04:27 +08:00
Devin Petersohn
eb1e5fa2cf Fixing Python2 compatibility issues. Adding inline docs (#3656) 2018-12-28 22:53:28 -08:00
Richard Liaw
aad3c50e2d
[tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Zhijun Fu
382b138fc7 fix code issues in object manager that are reported by scanning tool (#3649)
Fix some code issues found by code scanning tool:

**1. Macro compares unsigned to 0(NO_EFFECT)**

CWE570: An unsigned value can never be less than 0
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "this->create_buffer_state_[object_id].num_seals_remaining >= 0UL".

~/ray/src/ray/object_manager/object_buffer_pool.cc: ray::ObjectBufferPool::SealChunk(const ray::UniqueID &, unsigned long)

**2. Inferred misuse of enum(MIXED_ENUMS)**

CWE398: An integer expression which was inferred to have an enum type is mixed with a different enum type
This case, "static_cast(ray::object_manager::protocol::MessageType::PushRequest)", implies the effective type of "message_type" is "ray::object_manager::protocol::MessageType".

~/ray/src/ray/object_manager/object_manager.cc: ray::ObjectManager::ProcessClientMessage(std::shared_ptr> &, long, const unsigned char *)
2018-12-28 14:38:59 -08:00
Zhijun Fu
3df1e1c471 Add missing lock in FreeObjects of object buffer pool (#3647)
Object manager uses multi-threading for transferring objects between different nodes, the plasma client used in object_buffer_pool_ needs to be protected by lock. We have met crashes caused by missing lock in FreeObjects() interface, this PR fixes that issue.
2018-12-28 11:47:31 -08:00
Wang Qing
c59b506c6e [Java] Support calling Ray APIs from multiple threads (#3646) 2018-12-28 17:44:31 +08:00
Hao Chen
0b682d043e Fix memory leak in PyRayletCient (#3640)
1) if using `PyObject_GetIter`, the caller must call `Py_DECREF` to avoid memory leak. But with `PyList_GetItem`, `Py_DECREF` isn't needed.
2) the `Py_BuildValue` call in `wait` doesn't need to increment ref count.
2018-12-27 17:39:02 -08:00
Hao Chen
62af2f25be Fix test_multiple_actor_reconstruction failure (#3641)
* Fix test_multiple_actor_reconstruction failure

* add comment
2018-12-27 13:57:52 -08:00
Richard Liaw
ac792d70c8
[rllib] Add starcraft multiagent env as example (#3542) 2018-12-27 10:00:32 +08:00
Tianming Xu
b4f61dfd50 [rllib] Export policy model checkpoint (#3637)
* Export policy model checkpoint

* update comment
2018-12-27 08:43:06 +09:00
Richard Liaw
6e2d7a9ba1 [tune] Support Configuration Merging (#3584)
* merge configs

* deep merge

* lint

* add resolve

* test
2018-12-26 20:07:11 +09:00
Stan Wang
4ce3818be5 Average aggregated gradients before put in plasma store (#3631) 2018-12-26 20:03:11 +09:00
Wang Qing
4cde971916 [Java] Print the log message slowly. (#3633) 2018-12-26 16:33:21 +08:00
Yuhong Guo
1b98fb8238 Fix Jenkins test failures and function descriptor bug. (#3569)
## What do these changes do?
1. Fix the Jenkins test failure by add driver id to Actor GCS Key.
2. Move `object_manager_test.py` from Jenkins to Travis.
2018-12-25 23:31:44 -08:00
Wang Qing
a971b73bbe [Java] Fix the issue when waiting an empty list or a null pointer (#3632) 2018-12-26 11:29:29 +08:00
Hao Chen
f4011754d6 Fix: ServerConnection should be closed before being removed (#3626)
Otherwise, in the event of a remote raylet crashing, the connection might be held by boost asio forever, and the pending callbacks will never get invoked. See also #3586.
2018-12-25 11:01:53 -08:00
Robert Nishihara
5426234cd8 Update documentation to reflect 0.6.1 release. (#3622) 2018-12-24 11:10:04 -08:00
Robert Nishihara
1e8cdb5421 Update release documentation. (#3587)
* Update release instructions.

* Add note about wheels.

* Fix

* Update

* update example

* Update RELEASE_PROCESS.rst
2018-12-24 11:09:09 -08:00
nam-cern
3d8f56409b Ensure numpy is at least 1.10.4 in setup.py (#2462)
In the build script, numpy is specifically set at 1.10.4. We should also ensure that it is indeed the case in `setup.py`.
2018-12-24 11:01:25 -08:00
Eric Liang
9f63119a83
[rllib] Allow development without needing to compile Ray (#3623)
* wip

* lint

* wip

* wip

* rename

* wip

* Cleaner handling of cli prompt
2018-12-24 18:08:23 +09:00
Devin Petersohn
c13b2685f5 [modin] Append to path to avoid namespace collision on development branches (#3621) 2018-12-23 23:58:56 -08:00
Si-Yuan
a1995ff3b0 Resize logo in README. (#3619) 2018-12-23 22:59:23 -08:00
Alexey Tumanov
9b8d7573fe bump version from 0.6.0 to 0.6.1 (#3610) 2018-12-23 17:03:42 -08:00
Robert Nishihara
bb7ca3bae7 Upgrade flatbuffers version to 1.10.0. (#3559)
* Upgrade flatbuffers version to 1.10.0.

* Temporarily change ray.utils.decode for backwards compatibility.
2018-12-23 14:56:34 -08:00
Robert Nishihara
ddd4c842f1 Initialize some variables in constructor instead of header file. (#3617)
* Initialize some variables in constructor instead of header file
2018-12-23 02:44:23 -08:00
Alexey Tumanov
bada42c334 object store notification mgr: fix using uninitialized variables (#3592)
Initialize private class variables to avoid valgrind errors. They are used before initialization.
2018-12-22 19:51:22 -08:00
Philipp Moritz
e578a38116 Fix TensorFlow and PyTorch compatibility (#3574)
* remove tensorflow workaround
* update docker
* add boost threads
* add date_time, too
* change link order
* cosmetics
2018-12-22 13:25:48 -08:00
Tianming Xu
deb26b954e [rllib] Export tensorflow model of policy graph (#3585)
* Export tensorflow model of policy graph

* Add tests,examples,pydocs and infer extra signatures from existing methods

* Add example usage in export_policy_model comment

* Fix lint error

* Fix lint error

* Fix lint error
2018-12-22 17:35:25 +09:00
Wang Qing
8393df2516 Use BaseTest to instead of TestListener. (#3577) 2018-12-21 16:29:16 -08:00
Eric Liang
ddc97864df [rllib] Add requested clarifications to test requirement of contrib docs (#3589) 2018-12-21 11:02:02 -08:00
Alexey Tumanov
6b179cb8a7 change the order of allocation for io_service and gcs client in raylet main (#3597) 2018-12-21 00:13:28 -08:00
bibabolynn
e65b8f18f4 [java] change RayLog.core to org.slf4j.Logger (#3579) 2018-12-21 15:58:32 +08:00
Richard Liaw
e046a5c767
[tune] resources_per_trial from trial_resources (#3580)
Renaming variable due to user errors.
2018-12-20 19:00:47 -08:00
Devin Petersohn
a174a46e02 Allowing multiple users to access the /tmp/ray file at the same time (#3591)
* Allowing multiple users to access the /tmp/ray file at the same time

Previous sequence that caused this issue:
* User A starts ray with `ray.init` when /tmp/ray does not exist
* User B starts ray with `ray.init` and /tmp/ray now exists

User B will get a permissions error
Checking the permissions, /tmp/ray is 700

I have identified a race condition in `try_to_create_directory`
* Multiple processes try to create /tmp/ray at the same time
* chmod is either silently erroring or working properly within the race condition

Resolution: Move chmod outside of the check for whether the directory exists or not.

* Adding try except for users who do not own the directory
2018-12-20 18:46:54 -08:00
Stephanie Wang
34bab6291c
Cleanup actor handle pickling code (#3560)
* Cleanup actor handle pickling code

* remove unused

* fix

* lint
2018-12-20 16:37:21 -08:00
Eric Liang
6bb1103930 [rllib] Avoid sample wastage with bad PPO configurations (#3552)
## What do these changes do?

Previously we logged a warning if the PPO configuration would waste many samples. However, this didn't apply in the case of long episodes in `complete_episodes` batch mode, and also the amount of waste is up to 2x in common cases.

This pr:
- Estimates the number of sampling tasks needed to avoid over-sampling.
- Collects all sample results and never discards any. In principle this can degrade performance at large scale if certain machines are slower. Add a config flag to enable this legacy behavior.

## Related issue number

Closes: https://github.com/ray-project/ray/issues/3549
2018-12-20 10:50:44 -08:00