Commit graph

600 commits

Author SHA1 Message Date
Daniel Edgecumbe
2e30f7ba38 Add a web dashboard for monitoring node resource usage (#4066) 2019-02-21 00:10:04 -08:00
Jones Wong
3ac8fd7ee8 Exploration with Parameter Space Noise (#4048)
*  enable parameter space noise for exploration

*  enable parameter space noise for exploration

*  yapf formatted

*  remove the usage of scipy softmax avialable in the latest version only

*  enable subclass that has no parameter_noise in the config

*  run user specified callbacks and test parameter space noise in multi node setting

*  formatted by yapf

* Update dqn.py

* lint
2019-02-20 22:35:18 -08:00
Yuhong Guo
1f864a02bc Add option of load_code_from_local which is required in cross-language ray call. (#3675) 2019-02-21 12:37:17 +08:00
Robert Nishihara
e7651b1117 Fix excessive buffering of worker stdout/stderr. (#4094)
* Start workers with 'python -u' to prevent buffering of prints.

* Set sys.stdout and sys.stderr.

* Add comment.
2019-02-19 20:20:47 -08:00
Robert Nishihara
5fe7b1c618 Make object_manager_test::test_object_transfer_retry less flaky. (#4057)
* Make object_manager_test::test_object_transfer_retry less flaky.

* Make the test pass.
2019-02-19 20:03:11 -08:00
Eric Liang
e9ee38ace2 More compact format for worker logs (#4092) 2019-02-19 19:53:43 -08:00
Wang Qing
794a093249 Add runtime_context to get some runtime fields in worker (#4065) 2019-02-19 15:57:30 +08:00
Robert Nishihara
b78d77257b Speed up test/component_failures_test.py::test_actor_creation_node_failure. (#4056) 2019-02-17 15:35:54 -08:00
Robert Nishihara
5a9098891f Add serialization test for more collection types. (#3982)
* Add serialization test for more collection types.

* Reorganize serialization tests a little.

* Update
2019-02-17 13:57:33 -08:00
Megan Kawakami
346885068c [rllib] add torch pg (#3857)
* add torch pg

* add torch imports

* added torch pg

* working torch pg implementation

* add pg pytorch

* Update a3c.py

* Update a3c.py

* Update torch_policy_graph.py

* Update torch_policy_graph.py
2019-02-16 19:54:14 -08:00
Yu Kobayashi
d2d66c576e Support non ascii characters in the source code (#4047) 2019-02-16 11:45:44 +08:00
Hao Chen
de17443dc2
Propagate backend error to worker (#4039) 2019-02-16 11:39:15 +08:00
Robert Nishihara
5f71751891 API cleanups. Remove worker argument. Remove some deprecated arguments. (#4025)
* Remove worker argument from API methods.

* Remove deprecated arguments and deprecate redirect_output and redirect_worker_output.

* Fix
2019-02-15 10:49:16 -08:00
Philipp Moritz
077ffd99bf Bump version from 0.6.3 to 0.7.0.dev0 in docs and .yaml (#4042) 2019-02-14 12:08:48 -08:00
Philipp Moritz
810cc17062 Fix LRU eviction of client notification datastructure (#4021)
* convert notification_key map to C++ datastructure

* fix crash and add debug string

* clean notification map up (this was a bug before)

* remove checks

* add jenkins test

* linting

* fixes

* properly erase

* clean up

* linting

* Update test_wait_hanging.py

* Update run_multi_node_tests.sh

* increase redis_max_memory

* fix dat jenkins

* update

* Update run_multi_node_tests.sh
2019-02-13 22:20:27 -08:00
Eric Liang
2dccf383dd
[rllib] Basic infrastructure for off-policy estimation (IS, WIS) (#3941) 2019-02-13 16:25:05 -08:00
Kristian Hartikainen
729d0b2825 [autoscaler] docker run options (#3921)
Adds support for docker options, allowing for use of nvidia-docker.

Closes #2657.
2019-02-13 12:26:28 -08:00
Stephanie Wang
4347ab644e
Use Redis lists in the GCS instead of zset (#4023)
* Convert zset to list

* Remove object evictions map from the object directory, yay

* comments

* Fix tests
2019-02-13 10:32:57 -08:00
bjg2
0e37ac6d1d [wingman -> rllib] Remote and entangled environments (#3968)
* added all our environment changes

* fixed merge request comments and remote env

* fixed remote check

* moved remote_worker_envs to correct config section

* lint

* auto wrap impl

* fix

* fixed the tests
2019-02-13 10:08:26 -08:00
Philipp Moritz
b3f72e8a75 Add regression tests for dataclass serialization (#3984) 2019-02-13 09:07:03 -08:00
Hao Chen
f31a79f3f7
Implement actor checkpointing (#3839)
* Implement Actor checkpointing

* docs

* fix

* fix

* fix

* move restore-from-checkpoint to HandleActorStateTransition

* Revert "move restore-from-checkpoint to HandleActorStateTransition"

This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.

* resubmit waiting tasks when actor frontier restored

* add doc about num_actor_checkpoints_to_keep=1

* add num_actor_checkpoints_to_keep to Cython

* add checkpoint_expired api

* check if actor class is abstract

* change checkpoint_ids to long string

* implement java

* Refactor to delay actor creation publish until checkpoint is resumed

* debug, lint

* Erase from checkpoints to restore if task fails

* fix lint

* update comments

* avoid duplicated actor notification log

* fix unintended change

* add actor_id to checkpoint_expired

* small java updates

* make checkpoint info per actor

* lint

* Remove logging

* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager

* Replace old actor checkpointing tests

* Fix test and lint

* address comments

* consolidate kill_actor

* Remove __ray_checkpoint__

* fix non-ascii char

* Loosen test checks

* fix java

* fix sphinx-build
2019-02-13 19:39:02 +08:00
Si-Yuan
21472b890a Integrate "tempfile_service" into "ray.node.Node" (#3953) 2019-02-12 17:34:04 -08:00
Adi Zimmerman
dac1969647 [tune] Add Nevergrad to Tune (#3985) 2019-02-12 11:00:04 -08:00
Adi Zimmerman
9797028a91 [tune] Add scikit-optimize to Tune (#3924) 2019-02-11 17:06:02 -08:00
Ion
3c32343c63 Ray signal (#3624) 2019-02-11 10:14:48 -08:00
Zhijun Fu
7097ba393b protect raylet against bad messages (#4003)
* protect raylet against bad messages

* address comments

* linting and regression test
2019-02-12 00:39:38 +08:00
Yuhong Guo
5fb1efd60d Fix CI test failures (#4007) 2019-02-11 11:01:14 +08:00
Eric Liang
29322c7389
[rllib] Replay buffer for IMPALA should default to 0 slots. (#3971)
* disable replay

* make lq configurable

* leak test

* Update run_multi_node_tests.sh
2019-02-08 10:03:11 -08:00
Robert Nishihara
6a32b410bb Update versions from 0.6.2 -> 0.6.3 in the documentation. (#3981) 2019-02-07 20:57:37 -08:00
Robert Nishihara
ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Ion
f987572795 Inline objects (#3756)
* added store_client_ to object_manager and node_manager

* half through...

* all code in, and compiling! Nothing tested though...

* something is working ;-)

* added a few more comments

* now, add only one entry to the in GCS for inlined objects

* more comments

* remove a spurious todo

* some comment updates

* add test

* added support for meta data for inline objects

* avoid some copies

* Initialize plasma client in tests

* Better comments. Enable configuring nline_object_max_size_bytes.

* Update src/ray/object_manager/object_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* fiexed comments

* fixed various typos in comments

* updated comments in object_manager.h and object_manager.cc

* addressed all comments...hopefully ;-)

* Only add eviction entries for objects that are not inlined

* fixed a bunch of comments

* Fix test

* Fix object transfer dump test

* lint

* Comments

* Fix test?

* Fix test?

* lint

* fix build

* Fix build

* lint

* Use const ref

* Fixes, don't let object manager hang

* Increase object transfer retry time for travis?

* Fix test

* Fix test?

* Add internal config to java, fix PlasmaFreeTest
2019-02-07 10:32:39 -08:00
Philipp Moritz
3bb65677dc Use one memory mapped file for plasma (#3871) 2019-02-06 23:53:05 -08:00
vfdev
b2b8417790 [tune] Improve mnist_pytorch.py example (#3894)
## What do these changes do?

* Improved --no-cuda handling
* Removed deprecated Variable usage


## Related issue number

Fixes #3873 
<!-- Are there any issues opened that will be resolved by merging this change? -->
2019-02-04 17:59:54 -08:00
William Ma
f067223c4a Allow Ray processes to be started inside of gdb and tmux. (#3847) 2019-02-04 15:23:39 -08:00
Andrew Tan
8323419a6d [tune] Add SigOpt Integration (#3844) 2019-02-03 18:23:57 -08:00
Luke
002531b199 Enable LZ4 compression in pyarrow build (#3931)
Enable LZ4 compression in pyarrow build
2019-02-02 14:38:02 -08:00
Yuhong Guo
54cbb4396f Prepare socket file when start ray (#3925) 2019-02-02 12:53:36 +08:00
Daniel Edgecumbe
315edab085 [autoscaler] Speedups (#3720)
- NodeUpdater gets its' IP in parallel now (no longer in __init__)
- We use persistent connections in SSH (temp folder created only for ray; ControlMaster)
- hash_runtime_conf was performing a pointless hexlify step, wasting time on large files
- We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed
- AWSNodeProvider caches nodes more aggressively
- NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it
- AWSNodeProvider batches EC2 update_tags calls
- Logging changes throughout to provide standardised timing information for profiling
- Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway)

## Related issue number
Issue #3599
2019-02-01 02:46:32 -08:00
Peter Schafhalter
62a0a7bdc7 [tune] Add BayesOpt (#3864)
Adds BayesOpt as a Tune suggestion algorithm.
2019-01-31 16:54:17 -08:00
Richard Liaw
d128636bab Ray Logging Configuration (#3691)
* fix logging for autoscaler

* module logging

* try this for logging

* yapf

* fix

* Initial logging setup

* momery

* ok

* remove basicconfig

* catch

* remove package logging

* print

* fix

* try_fix

* fix 1

* revert rllib

* logging level

* flake8

* fix

* fix

* Remove vestigal TODO
2019-01-30 21:01:12 -08:00
Yuhong Guo
c45b91dcca Make redis module safe without crashing by removing RAY_CHECK (#3855) 2019-01-29 21:06:31 -08:00
Stephanie Wang
ad9f1721d1 Fix object_manager_test.py::object_transfer_retry test (#3863) 2019-01-27 13:55:38 -08:00
Yuhong Guo
066fa8abf3
Fix monitor_test.py by waiting for moniter.py to start working (#3840)
* Wait for moniter.py to start working

* Checkout None result in state.py
2019-01-25 18:07:15 +08:00
Si-Yuan
48139cf861 Migrate Python C extension to Cython (#3541) 2019-01-24 09:17:14 -08:00
Hao Chen
bfcf254e52 Fix: do not treat actor task as failed if the actor will be reconstructed (#3736) 2019-01-23 23:28:44 -08:00
Robert Nishihara
0b1608a546 Factor out code for starting new processes and test plasma store in valgrind. (#3824)
* Factor out starting Ray processes.

* Detect flags through environment variables.

* Return ProcessInfo from start_ray_process.

* Print valgrind errors at exit.

* Test valgrind in travis.

* Some valgrind fixes.

* Undo raylet monitor change.

* Only test plasma store in valgrind.
2019-01-22 14:59:11 -08:00
Robert Nishihara
9af5a62e05 Give better error for old-style actor classes. (#3793) 2019-01-17 19:05:04 -08:00
Richard Liaw
0537508106 Bump strings for 0.6.2 (#3801) 2019-01-17 19:03:27 -08:00
Jones Wong
319c1340cb [rllib] Develop MARWIL (#3635)
*  add marvil policy graph

*  fix typo

*  add offline optimizer and enable running marwil

*  fix loss function

*  add maintaining the moving average of advantage norm

*  use sync replay optimizer for unifying

*  remove offline optimizer and use sync replay optimizer

*  format by yapf

*  add imitation learning objective

*  fix according to eric's review

*  format by yapf

* revise

* add test data

* marwil
2019-01-16 19:00:43 -08:00
Richard Liaw
fa99fda2b4
Application Stress Tests (#3612) 2019-01-16 02:05:16 -08:00