Commit graph

2505 commits

Author SHA1 Message Date
bjg2
0e37ac6d1d [wingman -> rllib] Remote and entangled environments (#3968)
* added all our environment changes

* fixed merge request comments and remote env

* fixed remote check

* moved remote_worker_envs to correct config section

* lint

* auto wrap impl

* fix

* fixed the tests
2019-02-13 10:08:26 -08:00
Philipp Moritz
b3f72e8a75 Add regression tests for dataclass serialization (#3984) 2019-02-13 09:07:03 -08:00
Hao Chen
f31a79f3f7
Implement actor checkpointing (#3839)
* Implement Actor checkpointing

* docs

* fix

* fix

* fix

* move restore-from-checkpoint to HandleActorStateTransition

* Revert "move restore-from-checkpoint to HandleActorStateTransition"

This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.

* resubmit waiting tasks when actor frontier restored

* add doc about num_actor_checkpoints_to_keep=1

* add num_actor_checkpoints_to_keep to Cython

* add checkpoint_expired api

* check if actor class is abstract

* change checkpoint_ids to long string

* implement java

* Refactor to delay actor creation publish until checkpoint is resumed

* debug, lint

* Erase from checkpoints to restore if task fails

* fix lint

* update comments

* avoid duplicated actor notification log

* fix unintended change

* add actor_id to checkpoint_expired

* small java updates

* make checkpoint info per actor

* lint

* Remove logging

* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager

* Replace old actor checkpointing tests

* Fix test and lint

* address comments

* consolidate kill_actor

* Remove __ray_checkpoint__

* fix non-ascii char

* Loosen test checks

* fix java

* fix sphinx-build
2019-02-13 19:39:02 +08:00
Andrew Tan
57dcd3033e [tune] Trial reporter fix (#3951)
Fixes #3949.
2019-02-13 01:03:54 -08:00
Wang Qing
3a7fb182cc Change the num of parallel jobs when building 2019-02-13 00:33:05 -08:00
William Ma
e1a479b137 Add teardown_module to test_queue.py (#4012) 2019-02-12 22:43:09 -08:00
Si-Yuan
21472b890a Integrate "tempfile_service" into "ray.node.Node" (#3953) 2019-02-12 17:34:04 -08:00
Adi Zimmerman
dac1969647 [tune] Add Nevergrad to Tune (#3985) 2019-02-12 11:00:04 -08:00
Wang Qing
c523bc04ad Enable redis password in Java worker (#3943)
* Support Java redis password

* Fix

* Refine

* Fix lint.
2019-02-12 13:11:25 +08:00
Adi Zimmerman
9797028a91 [tune] Add scikit-optimize to Tune (#3924) 2019-02-11 17:06:02 -08:00
Eric Liang
8df772867c
[rllib] rename compute_apply to learn_on_batch 2019-02-11 15:22:15 -08:00
Eric Liang
c4182463f6
[rllib] Add helper to iterate over envs in a vectorized environment (#4001)
* add foreach env func

* fix

* add test
2019-02-11 10:40:47 -08:00
Daniel Edgecumbe
a70ae1687b .gitignore: Add Vim swap files (#4016) 2019-02-11 10:27:10 -08:00
Ion
3c32343c63 Ray signal (#3624) 2019-02-11 10:14:48 -08:00
ebrevdo
52dfde1cbb Update flatbuffer bazel rule to work with flatbuffer master branch. (#4008) 2019-02-11 10:00:06 -08:00
Zhijun Fu
7097ba393b protect raylet against bad messages (#4003)
* protect raylet against bad messages

* address comments

* linting and regression test
2019-02-12 00:39:38 +08:00
Wang Qing
bc438ca73b [Java] Refine Java config item (#4014)
* Refine

* Address comment.
2019-02-11 23:55:40 +08:00
Philipp Moritz
ab809bd927 update ray version to 0.7.0dev (#3995) 2019-02-10 19:56:42 -08:00
Eric Liang
8e9f2c923f
[autoscaler] Use RLock in addition to FileLock 2019-02-10 19:16:43 -08:00
Yuhong Guo
5fb1efd60d Fix CI test failures (#4007) 2019-02-11 11:01:14 +08:00
bjg2
e703b9f49d [wingman -> rllib] Improved stats changes in AsyncSamplesOptimizer (#3966)
* added stats changes to optimizer

* changes timers

* fix python 2 compat

* improved optimizer throughput stats

* Update async_samples_optimizer.py

* fix python2 compat
2019-02-10 01:25:22 -08:00
Yuhong Guo
3a66d47a3a
Remove RAY_CHECK from JNI code (#3978)
* Remove RAY_CHECK in JNI

* Try to add mvn test to test the exception.

* Refine

* Address comments
2019-02-09 18:10:22 +08:00
bibabolynn
728031a972 [java] when put an object in plasma store, ignore "object alreay exists" exception (#3687)
* distinct plasma client exception

* Update ObjectStoreProxy.java

* Update and rename PlasmaArrowTest.java to PlasmaStoreTest.java

* store put

* Use testng to replace junit to fix test failure
2019-02-09 18:03:17 +08:00
Eric Liang
29322c7389
[rllib] Replay buffer for IMPALA should default to 0 slots. (#3971)
* disable replay

* make lq configurable

* leak test

* Update run_multi_node_tests.sh
2019-02-08 10:03:11 -08:00
Robert Nishihara
6a32b410bb Update versions from 0.6.2 -> 0.6.3 in the documentation. (#3981) 2019-02-07 20:57:37 -08:00
Robert Nishihara
ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Philipp Moritz
0aa74fb1fd Update cloudpickle to 0.8.0.dev0 (#3964) 2019-02-07 15:24:06 -08:00
Eric Liang
ae4bc7d6e8
[revert] [rllib] Add copy() in async samples optimizer 2019-02-07 14:14:39 -08:00
markgoodhead
5ce670cb36 [tune] Add Initial Parameter Suggestion for HyperOpt (#3944)
Allows users of the HyperOptSearch suggestion algorithm to specify initial experiment values to run (typically already known good baseline parameters within the domain specified)
2019-02-07 10:57:51 -08:00
Ion
f987572795 Inline objects (#3756)
* added store_client_ to object_manager and node_manager

* half through...

* all code in, and compiling! Nothing tested though...

* something is working ;-)

* added a few more comments

* now, add only one entry to the in GCS for inlined objects

* more comments

* remove a spurious todo

* some comment updates

* add test

* added support for meta data for inline objects

* avoid some copies

* Initialize plasma client in tests

* Better comments. Enable configuring nline_object_max_size_bytes.

* Update src/ray/object_manager/object_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* fiexed comments

* fixed various typos in comments

* updated comments in object_manager.h and object_manager.cc

* addressed all comments...hopefully ;-)

* Only add eviction entries for objects that are not inlined

* fixed a bunch of comments

* Fix test

* Fix object transfer dump test

* lint

* Comments

* Fix test?

* Fix test?

* lint

* fix build

* Fix build

* lint

* Use const ref

* Fixes, don't let object manager hang

* Increase object transfer retry time for travis?

* Fix test

* Fix test?

* Add internal config to java, fix PlasmaFreeTest
2019-02-07 10:32:39 -08:00
Richard Liaw
5db1afef07
[tune] Support Custom Resources (#2979)
Support arbitrary resource declarations in Tune.

Fixes https://github.com/ray-project/ray/issues/2875
2019-02-07 00:29:19 -08:00
Robert Nishihara
a654152f9c Pin gym version in Python 2 tests. (#3973) 2019-02-06 23:56:14 -08:00
Philipp Moritz
3bb65677dc Use one memory mapped file for plasma (#3871) 2019-02-06 23:53:05 -08:00
Stephanie Wang
d2b6db3db1
Bump version from 0.6.2 to 0.6.3 (#3972) 2019-02-06 19:11:16 -08:00
Eric Liang
04fc145a44 [autoscaler] Autoscaler hangs forever on non-zero exit code command (#3969) 2019-02-06 17:25:24 -08:00
Stephanie Wang
49e9bec988
Fix raylet bug in driver cleanup (#3962)
* Fix task dependency manager cleanup on driver exit

* Add regression test

* Better check, update header
2019-02-06 11:19:10 -08:00
Stephanie Wang
244fd473f4
Only mark tasks as forwarded if they are in the lineage cache (#3958) 2019-02-05 23:01:38 -08:00
Alex LaGrassa
b0fe5af7c8 [doc] Update example-parameter-server.rst (#3773) 2019-02-05 22:00:54 -08:00
Robert Nishihara
fa4eb8313d Suppress warning for serializing different unique ID types in Python. (#3872)
* Suppress warning for serializing different unique ID types in Python.

* Add _ID_TYPES variable.
2019-02-05 11:38:33 -08:00
vfdev
b2b8417790 [tune] Improve mnist_pytorch.py example (#3894)
## What do these changes do?

* Improved --no-cuda handling
* Removed deprecated Variable usage


## Related issue number

Fixes #3873 
<!-- Are there any issues opened that will be resolved by merging this change? -->
2019-02-04 17:59:54 -08:00
Eric Liang
5fb813ff39
Don't check fail on missing lineage cache entry (#3861) 2019-02-04 17:45:41 -08:00
William Ma
f067223c4a Allow Ray processes to be started inside of gdb and tmux. (#3847) 2019-02-04 15:23:39 -08:00
Yuhong Guo
add8ae7063 Add bazel build for JNI code (#3918)
* Add bazel build for JNI code

* clean

* Add plasma client JNI build process

* refine

* clean linux part

* Add Java Library

* Remove java library

* Generate dylib after build using genrule
2019-02-04 13:03:46 -08:00
Wang Qing
e1c68a0881 Enable including Java worker for ray start command (#3838) 2019-02-04 16:23:43 +08:00
Eric Liang
7ef830bef1 [rllib] Add copy() in async samples optimizer to fix memory leak (#3938)
Fixes #3884.
2019-02-03 18:34:37 -08:00
Andrew Tan
8323419a6d [tune] Add SigOpt Integration (#3844) 2019-02-03 18:23:57 -08:00
Kristian Hartikainen
85294fb503 [autoscaler] node caching changes (#3937)
Breaks the node provider node getter into cached and non-cached versions.

Fixes #3930 by updating the node label finger print before updating labels.
Fixes #3935 by refreshing node cache if node ip is not found.
2019-02-03 17:48:07 -08:00
James Casbon
976f018dab [autoscaler] GCP: only call setIamPolicy if necessary (#3782) 2019-02-03 16:16:00 -08:00
James Casbon
b8cc176b4d [autoscaler] Document gcp subnet config (#3783)
Adds info to the gcp example yaml on using shared subnets.
2019-02-03 16:14:44 -08:00
Si-Yuan
9295ab8f60 Various Python code cleanups. (#3837) 2019-02-03 10:16:24 -08:00