Danyang Zhuo
4f583ec784
Improve Object Transfer Performance ( #6067 )
2019-11-18 14:40:34 -08:00
Yuhao Yang
d3ff2252c4
[doc] Fix link to getting involved
2019-11-18 12:59:14 -08:00
Eric Liang
8fc2272f43
[rllib] Reorganize trainer config, add warnings about high VF loss magnitude for PPO ( #6181 )
2019-11-18 10:39:07 -08:00
Ujval Misra
2965dc1b72
[tune] Fault tolerance improvements ( #5877 )
...
* Precede ray.get with ray.wait.
* Trigger checkpoint deletes locally in Trainable
* Clean-up code.
* Minor changes.
* Track best checkpoint so far again
* Pulled checkpoint GC out of Trainable.
* Added comments, error logging.
* Immediate pull after checkpoint taken; rsync source delete on pull
* Minor doc fixes
* Fix checkpoint manager bug
* Fix bugs, tests, formatting
* Fix bugs, feature flag for force sync.
* Fix test.
* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.
* Fix bug: update IP of last_result.
* Fixed message.
* Added a lot of logging.
* Changes to ray trial executor.
* More bug fixes (logging after failure), better logging.
* Fix richards bug and logging
* Add comments.
* try-except
* Fix heapq bug.
* .
* Move handling of no available trials to ray_trial_executor (#1 )
* Fix formatting bug, lint.
* Addressed Richard's comments
* Revert tests.
* fix rebase
* Fix trial location reporting.
* Fix test
* Fix lint
* Rebase, use ray.get w/ timeout, lint.
* lint
* fix rebase
* Address richard's comments
2019-11-18 01:14:41 -08:00
Stephanie Wang
66edebce3a
Spillback scheduling for direct task calls ( #6164 )
...
* add dac
* remove cachign
* rename return buffer
* cleanup
* add tests
* add perf
* fix
* flip
* remove
* remove it
* lint
* remove fork safety
* lint
* comments
* s/core/client
* wip
* remove
* fmt
* consistently return direct naming
* basic pass by ref
* fix bugs
* wip
* wip
* wip
* wip
* add test
* works now
* fix constructor
* fix merge
* add todo for perf
* fix single client test
* use lower n
* bazel
* faster
* fix core worker test
* init
* fix tests
* no plasma for direct call
* Update worker.py
* add order test
* fixes
* comments
* remove old assert
* lint
* add test
* Very wip
* wip
* add options for tasks
* add test
* fmt
* add backpressure
* remove idle prof event
* lint
* Fix 0 returns
* Set memcopy threads globally
* add benchmark
* Fix object exists
* Fix reference
* Remove return_buffer
* Add check
* add exit handler
* update benchmarks
* Fix compile error
* Fix NoReturn
* Use is instead of == for NoReturn
* fix
* Remove list comprehension
* Fix core worker test
* comment
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* fix merge error
* lint
* wip
* fix merge
* wip
* finish
* lint
* task interface
* add file
* add
* wip
* now works!
* updated
* wip
* dep resolution
* remove remote dep handling
* comments
* fix test_multithreading
* fix merge
* fix exit handling
* fix merge
* comments
* get fallback fetch working
* handle contains
* fix typo
* Skeleton for SubmitTask proto
* Update src/ray/common/id.h
Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>
* comments
* rename to core worker service
* lint
* fix compile
* wip
* update
* error code
* fix up and rename
* clean up call manager
* comments
* add test and cleanup deserialization
* fix pickle
* fix comments, lint
* test todo
* comments
* use shared ptr
* rename
* Update src/ray/protobuf/gcs.proto
Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>
* require transport type for ids; lint
* cleanup
* comments 1
* use worker available for real
* wip
* fix test
* resolve local dependencies test
* add num pending metric
* client factory
* unit test task submission
* wip
* fix bug
* rename
* Pass through node manager port, connect in raylet client
* finish rename
* Switch submit task to grpc
* fix crash
* Check port in use
* fix merge
* comments more
* doc
* Remove default port, set port randomly from driver
* add unique_ptr comment about TaskSpec
* lint
* fix test
* update
* fix lint
* GetMessageMutable should not be const
* iwyu
* fix const
* Update direct_task_transport_test.cc
* fix segfault
* Fix test
* Add RpcAddress, set in actor table data
* fix serialization
* fix lint
* Pass through task caller address
* Fix object manager test
* RpcAddress -> Address
* merge
* Port WorkerLease to grpc
* wip
* fix test
* add mem test
* update
* comments
* fix core worker tests
* fix
* remove old worker lease code
* First pass on spillback
* lint
* crash?
* Debug
* Fix task spec copy, extend test basic
* lint
* Port return worker to grpc
* lint
* Return worker to the correct raylet
* Only request worker if queued tasks
* A bit better failure handling
* Fix unit test
* Add unit test for spillback
* fix
* python test multinode
* update
* updates
* fix
2019-11-17 20:29:32 -08:00
Philipp Moritz
fc655acfee
Fix linting on master branch ( #6174 )
2019-11-16 10:02:58 -08:00
Eric Liang
a68cda0a33
[rllib] remove exists call ( #6168 )
2019-11-15 21:59:40 -08:00
Danyang Zhuo
30e2b6b91b
Microbenchmark for inter-node object transfer ( #6098 )
2019-11-15 21:39:06 -08:00
Adam Gleave
e8cce3fdd4
[autoscaler]: automatically pull new docker image ( #6111 )
...
* Docker: automatically pull new image
* Fix missing value in schema
* Address review comments
2019-11-15 21:26:28 -08:00
Ion
1b80675206
Scheduling ids ( #6137 )
2019-11-15 16:04:16 -08:00
Edward Oakes
dee696577f
Fix passing object ids in local mode ( #6170 )
2019-11-15 15:46:39 -08:00
Edward Oakes
33040d734f
Disable stopgap GC by default ( #6165 )
...
* disable stopgap gc by default
* fix gc testss
2019-11-15 15:42:59 -08:00
Hersh Godse
7aa06fb25c
[tune] ExperimentalAnalysis in-memory cache ( #5962 )
2019-11-15 12:47:50 -08:00
Eric Liang
7d33e9949b
Integrate ref count module into local memory store ( #6122 )
2019-11-15 10:52:19 -08:00
Richard Liaw
62cbc043b4
[tune] tbx logger ( #6133 )
...
* tbx
* add_hparams
* fix_hparams
* ok
* ok
* fix
* ok
* fix
2019-11-15 08:45:44 -08:00
Eric Liang
8ff393a7bd
Handle exchange of direct call objects between tasks and actors ( #6147 )
2019-11-14 17:32:04 -08:00
Edward Oakes
385783fcec
Ray on YARN + Skein Documentation ( #6119 )
2019-11-14 15:06:05 -08:00
Edward Oakes
2758cd0b34
Make log message debug ( #6166 )
2019-11-14 15:05:36 -08:00
Edward Oakes
e3b95dafeb
Fix sigterm_handler ( #6141 )
2019-11-14 13:41:50 -08:00
Eric Liang
243b1b7281
[rllib] Add microbatch optimizer with A2C example ( #6161 )
2019-11-14 12:14:00 -08:00
Eric Liang
0a3623ded6
Fix memory store wait ( #6152 )
2019-11-14 10:17:30 -08:00
Stephanie Wang
bbadde57e0
Pass through caller address when submitting a task ( #6143 )
...
* Add RpcAddress, set in actor table data
* Pass through task caller address
* RpcAddress -> Address
* update
* fix
* lint
* fix cc tests
2019-11-14 09:14:08 -08:00
Ujval Misra
e3e3ad4b25
Add timeout param to ray.get ( #6107 )
2019-11-14 00:50:04 -08:00
waldroje
e4c0843f60
Allow EntropyCoeffSchedule to accept custom schedule ( #6158 )
...
* modify tf_policy to enable EntropyCoeffSchedule to handle list, and avoid negative values under current implementation
* Update custom_metrics_and_callbacks.py
* Update tf_policy.py
2019-11-14 00:45:43 -08:00
Eric Liang
e4565c9cc6
Reduce RLlib log verbosity ( #6154 )
2019-11-13 18:50:45 -08:00
Edward Oakes
51e76151d6
Use shared_ptr for gcs client in profiler ( #6150 )
2019-11-13 15:24:01 -08:00
Philipp Moritz
f24d96ec4f
Revert "Try to enable dashboard (again) ( #6069 )" ( #6159 )
...
This reverts commit 4044af8520
.
2019-11-13 12:32:12 -08:00
Eric Liang
b924299833
Add large scale regression test for RLlib ( #6093 )
2019-11-13 12:22:55 -08:00
Eric Liang
f3f86385d6
Minimal implementation of direct task calls ( #6075 )
2019-11-12 11:45:28 -08:00
Stephanie Wang
35d177f459
Use grpc for communication from worker to local raylet (task submission and direct actor args only) ( #6118 )
...
* Skeleton for SubmitTask proto
* Pass through node manager port, connect in raylet client
* Switch submit task to grpc
* Check port in use
* doc
* Remove default port, set port randomly from driver
* update
* Fix test
* Fix object manager test
2019-11-11 21:17:25 -08:00
Siyuan (Ryans) Zhuang
f48293f96d
Fix deprecated warning ( #6142 )
2019-11-11 17:49:15 -08:00
Simon Mo
c75ada9e04
[Autoscaler][K8s] Enforce memory limit in k8s yaml ( #6138 )
...
* Enforce memory limit in k8s yaml
* Update python/ray/autoscaler/kubernetes/example-full.yaml
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Line wrap
2019-11-11 14:06:34 -08:00
Adi Zimmerman
776b071f3b
[tune] Let Search Algorithms use early stopped trials ( #5651 )
2019-11-11 09:38:14 -08:00
Edward Oakes
5780ec1b62
Refresh ObjectIDs in raylet for stopgap GC ( #6109 )
2019-11-10 23:12:59 -08:00
Philipp Moritz
decaa65cd6
Use pickle by default for serialization ( #5978 )
2019-11-10 18:12:18 -08:00
Adam Gleave
01aee8d970
[autoscaler] Retry creating EC2 instances in new AZ ( #6129 )
2019-11-09 19:44:27 -08:00
Miguel Morales
d17ae5ad7a
Update hyperband-cartpole.yaml ( #6121 )
...
Typo
2019-11-09 19:39:03 -08:00
Adam Gleave
c157e93ba1
[tune] Retry failed tasks with checkpointing disabled ( #6126 )
...
* Allow recovery for failed tasks without checkpointing
* Update docs
2019-11-09 19:35:27 -08:00
Philipp Moritz
ccbcc4bafa
Use GRCP and Bazel 1.0 ( #6002 )
2019-11-08 15:58:28 -08:00
Eric Liang
afca6d3d87
Object store full with cyclic python references ( #6114 )
2019-11-08 14:08:24 -08:00
Edward Oakes
83378a8610
Improve flaky test_warning_monitor_died ( #6113 )
2019-11-08 12:11:15 -08:00
Eric Liang
4044af8520
Try to enable dashboard (again) ( #6069 )
...
* Revert "Revert "Enable the Ray dashboard by default (#5976 )" (#6068 )"
This reverts commit 1a3e97cf23
.
* fix tests that assume the dashboard isn't a job
* travis
2019-11-08 10:48:48 -08:00
Philipp Moritz
5a05eaaa54
Fix compilation on master ( #6116 )
2019-11-07 22:38:42 -08:00
Eric Liang
4a28306186
Allow large returns from direct actor calls ( #6088 )
2019-11-07 21:28:55 -08:00
Edward Oakes
ca53af4d0f
Add pending task dependencies to ObjectID ref counting ( #6054 )
2019-11-07 18:37:10 -08:00
Eric Liang
1f043daf69
[rllib] Fix and add test for LR annealing config
2019-11-07 12:17:27 -08:00
Simon Mo
fcb6bdbc39
[Doc] Document Actor.options API ( #6099 )
...
* Document Actor.options API
* Undocument _remote
2019-11-06 23:12:23 -08:00
Edward Oakes
9820c10a09
Simplify gRPC service definition for the worker ( #6095 )
2019-11-06 13:00:39 -08:00
David Bignell
3f83b2daa9
[rllib] Rollout extensions ( #6065 )
...
* Rollout improvements
* Make info-saving optional, to avoid breaking change.
* Store generating ray version in checkpoint metadata
* Keep the linter happy
* Add small rollout test
* Terse.
* Update test_io.py
2019-11-05 20:34:18 -08:00
Eric Liang
2a0225dd25
[rllib] RLlib chooses wrong neural network model for Atari in 0.7.5 ( #6087 )
2019-11-05 11:36:29 -08:00