Stephanie Wang
f6a0408173
Track pending tasks with TaskManager ( #6259 )
...
* TaskStateManager to track and complete pending tasks
* Convert actor transport to use task state manager
* Refactor direct actor transport to use TaskStateManager
* rename
* Unit test
* doc
* IsTaskPending
* Fix?
* Shared ptr
* HUH?
* Update src/ray/core_worker/task_manager.cc
Co-Authored-By: Zhijun Fu <37800433+zhijunfu@users.noreply.github.com>
* Revert "HUH?"
This reverts commit f80f0ba204ff4da5e0b03191fa0d5a4d9f552434.
* Fix memory issue
* oops
2019-11-25 16:37:26 -08:00
mehrdadn
ed5154d7fe
Modify RayLogLevel to avoid conflicts with DEBUG macro and ERROR macros that are defined externally ( #6204 )
...
* Prevent name collision of ERROR macro from Windows with RayLogLevel::ERROR
2019-11-25 17:02:26 -07:00
mehrdadn
ca08a8f479
Update grpc to version that fixes typo in third_party/py/python_configure.bzl ( #6235 )
...
See https://github.com/grpc/grpc/pull/20774
2019-11-25 15:20:33 -07:00
Eric Liang
64a3a7239e
Set RAY_FORCE_DIRECT=1 for run_rllib_tests, test_basic ( #6171 )
2019-11-25 14:12:11 -08:00
Edward Oakes
c9314098b9
Implement direct task worker lease timeouts ( #6188 )
2019-11-25 14:48:19 -07:00
Edward Oakes
e72aef2ba6
[hotfix] Fix building linux wheels
2019-11-25 12:45:31 -07:00
Ameer Haj Ali
71316fa8d0
wrap models with DistributionalQModel when running DQN ( #6258 )
...
* wrap models with DistributionalQModel when running DQN
* wrap only for tensorflow models
* Update custom_keras_model.py
2019-11-25 00:11:24 -08:00
Eric Liang
7917bbef78
Set progress report interval for bazel explicitly ( #6262 )
...
* set progress internval
* add keep alive
* add keepalive
* remove cat
* smaller time
* squash error
* reduce log spam
2019-11-24 22:37:59 -08:00
Simon Mo
c8b69727cd
ray stop
only kills process with ray
keyword (#6257 )
...
* Use psutil to kill processes
* Psutil as core requirement
* Revert "Psutil as core requirement"
This reverts commit d3235ce3d994d2bb7db39e3ad4a46049703898bb.
* Revert "Use psutil to kill processes"
This reverts commit de0ed874fed673f5e98715950688f418bbcc415c.
* Revert back to subproc
* Add comments, grep for ray as well
* SIGTERM
2019-11-24 16:32:07 -08:00
Eric Liang
e5b5c98558
Fix python PATH for build ( #6260 )
2019-11-24 15:32:06 -08:00
Eric Liang
53641f1f74
Move more unit tests to bazel ( #6250 )
...
* move more unit tests to bazel
* move to avoid conflict
* fix lint
* fix deps
* seprate
* fix failing tests
* show tests
* ignore mismatch
* try combining bazel runs
* build lint
* remove tests from install
* fix test utils
* better config
* split up
* exclusive
* fix verbosity
* fix tests class
* cleanup
* remove flaky
* fix metrics test
* Update .travis.yml
* no retry flaky
* split up actor
* split basic test
* split up trial runner test
* split stress
* fix basic test
* fix tests
* switch to pytest runner for main
* make microbench not fail
* move load code to py3
* test is no longer package
* bazel to end
2019-11-24 11:43:34 -08:00
Simon Mo
aa8d5d2f6c
Rate limit asyncio actor ( #6242 )
2019-11-24 11:39:28 -08:00
Simon Mo
9f0d005ce6
Use jobs 50 ( #6255 )
2019-11-24 00:32:38 -08:00
Yuhao Yang
f6a5baf844
[tune] minor doc fix ( #6248 )
2019-11-23 21:54:41 -08:00
Stephanie Wang
d2662fecea
Miscellaneous bug fixes to throw unreconstructable errors for direct calls ( #6245 )
...
* Test cases
* Fix InPlasmaError
* raylet fixes to force errors for direct calls
* Disable lineage logging and task pending checks for direct calls
* move todo
* Clean up tests
* Fix bugs in object store for Contains and Delete
* Use direct call in tests
* Fixes, separate actor creation direct call from normal direct call spec
2019-11-23 15:05:49 -08:00
Stephanie Wang
c4fa3b3afb
fix ( #6251 )
2019-11-23 15:04:48 -08:00
Eric Liang
ea270495a1
Remove stray change ( #6247 )
2019-11-23 00:07:45 -08:00
mehrdadn
94d37eee28
Update Boost via our own rule instead of managing our own fork ( #6238 )
2019-11-22 16:10:47 -08:00
Edward Oakes
ae5abc48a9
Fix race condition in redis_async_context.cc ( #6231 )
...
* dispatch callback to backend thread
* tmp: test in loop
* compiling
* Works using shared_ptrs
* Revert "tmp: test in loop"
This reverts commit faf1f8f74b34a99396906f56827d2691472ae7d4.
* Copy into CallbackReply
* fix comment
* warning
* add nil case
2019-11-22 15:51:40 -08:00
Simon Mo
f53f576120
Quiet Wget ( #6244 )
2019-11-22 14:32:14 -08:00
Eric Liang
b052bcf1fc
Bazelify tune tests in travis ( #6219 )
2019-11-22 13:58:50 -08:00
Ion
68ac08332b
Initial commit of new cluster resource scheduler ( #6178 )
2019-11-22 11:14:46 -08:00
mehrdadn
05ce789e5b
Reorganize ray_deps_setup.bzl to make all the GitHub rules uniform and download ZIP files for everything ( #6193 )
...
* Reorganize ray_deps_setup.bzl to make all the GitHub rules uniform
* Rewrite github_repository with explicit keyword-only arguments
Requires Bazel >= 0.29.0: https://github.com/bazelbuild/buildtools/pull/677
2019-11-22 09:59:32 -08:00
Simon Mo
eb6a93c0f0
[hotfix] fix lint ( #6236 )
2019-11-21 18:30:57 -08:00
Eric Liang
7559fdb141
[rllib/tune] Cache get_preprocessor() calls, default max_failur… ( #6211 )
2019-11-21 15:55:56 -08:00
Stephanie Wang
d3227f2f2d
Fix bug in direct task calls for objects that were evicted ( #6216 )
...
* Fix bug and add some checks
* rename
2019-11-21 15:38:31 -08:00
Stephanie Wang
eb7b73d731
Disconnect direct task workers that died ( #6213 )
...
* Disconnect workers that died so that we push the worker died error to redis
* Push error if actor is non nil
* fix test
2019-11-21 15:37:15 -08:00
mehrdadn
ba86c75c21
Patch Cython in grpc to use our COPTS ( #6223 )
2019-11-21 15:32:48 -08:00
Simon Mo
57e101e648
[CI] Pass cloud cache secrets to linux wheel ( #6232 )
2019-11-21 14:41:13 -08:00
Simon Mo
29ba6bfc64
Basic Async Actor Call ( #6183 )
...
* Start trying to figure out where to put fibers
* Pass is_async flag from python to context
* Just running things in fiber works
* Yield implemented, need some debugging to make it work
* It worked!
* Remove debug prints
* Lint
* Revert the clang-format
* Remove unnecessary log
* Remove unncessary import
* Add attribution
* Address comment
* Add test
* Missed a merge conflict
* Make test pass and compile
* Address comment
* Rename async -> asyncio
* Move async test to py3 only
* Fix ignore path
2019-11-21 11:56:46 -08:00
Simon Mo
c4132b501b
[CI] Add Remote Caching ( #6210 )
2019-11-21 11:36:36 -08:00
Eric Liang
7f52d019ca
Inline memory_store_provider into memory_store ( #6217 )
2019-11-21 10:13:53 -08:00
Philipp Moritz
a4437813eb
[Projects] Unify hyphen vs underscore handling for arguments ( #6208 )
2019-11-20 23:52:41 -08:00
Eric Liang
1f9ab74293
Fix hang on Ray shutdown ( #6201 )
2019-11-20 23:30:35 -08:00
Eric Liang
425edb5cd9
Support NotifyBlocked/UnBlocked for direct call tasks ( #6177 )
2019-11-20 22:07:12 -08:00
Stephanie Wang
db77595298
Fix segfault for task arguments passed by value ( #6214 )
...
* Fix null data
* rename
2019-11-20 22:02:18 -08:00
mehrdadn
95bf977839
Rename UpdateResource due to conflict with Windows ( #6205 )
...
* Rename UpdateResource due to conflict with Windows
* Rename UpdateResource_ to UpdateResourceCapacity
2019-11-20 20:44:13 -08:00
Stephanie Wang
c0be9e6738
Resolve dependencies locally before submitting direct actor tasks ( #6191 )
...
* Priority queue in direct actor transport by task number
* Move LocalDependencyResolver out to separate file, share with direct actor transport
* works
* Test case for ordering
* Cleanups
* Remove priority queue
* comment
* Share ClientFactoryFn with direct actor transport
* Unit test
* fix
2019-11-20 16:45:19 -08:00
Philipp Moritz
33c768ebe4
Fix worker signal.SIGTERM handler being installed from outside the main thread ( #6176 )
2019-11-20 11:14:28 -08:00
Ujval Misra
0010382cc7
[tune] Report failures in a separate table ( #6160 )
...
* Report errors in a separate table.
* Single error file.
2019-11-20 10:53:47 -08:00
micafan
e7dbafa000
fix gcs::RedisAsioClient non-thread safe ( #5946 )
2019-11-20 10:18:35 -08:00
Eric Liang
23ef58716d
Fix crash on sys.exit of direct task calls ( #6202 )
2019-11-19 21:30:48 -08:00
Richard Liaw
d3c7a8fda5
[docs] yarn update ( #6173 )
2019-11-19 16:15:08 -08:00
ashione
a1744f67fe
Add hostname to nodeinfo( #6156 )
2019-11-19 15:03:46 +08:00
mehrdadn
f9d2d106b1
Updates to .bazelrc to address some issues seen on Windows ( #6187 )
2019-11-18 20:54:23 -08:00
Danyang Zhuo
4f583ec784
Improve Object Transfer Performance ( #6067 )
2019-11-18 14:40:34 -08:00
Yuhao Yang
d3ff2252c4
[doc] Fix link to getting involved
2019-11-18 12:59:14 -08:00
Eric Liang
8fc2272f43
[rllib] Reorganize trainer config, add warnings about high VF loss magnitude for PPO ( #6181 )
2019-11-18 10:39:07 -08:00
Ujval Misra
2965dc1b72
[tune] Fault tolerance improvements ( #5877 )
...
* Precede ray.get with ray.wait.
* Trigger checkpoint deletes locally in Trainable
* Clean-up code.
* Minor changes.
* Track best checkpoint so far again
* Pulled checkpoint GC out of Trainable.
* Added comments, error logging.
* Immediate pull after checkpoint taken; rsync source delete on pull
* Minor doc fixes
* Fix checkpoint manager bug
* Fix bugs, tests, formatting
* Fix bugs, feature flag for force sync.
* Fix test.
* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.
* Fix bug: update IP of last_result.
* Fixed message.
* Added a lot of logging.
* Changes to ray trial executor.
* More bug fixes (logging after failure), better logging.
* Fix richards bug and logging
* Add comments.
* try-except
* Fix heapq bug.
* .
* Move handling of no available trials to ray_trial_executor (#1 )
* Fix formatting bug, lint.
* Addressed Richard's comments
* Revert tests.
* fix rebase
* Fix trial location reporting.
* Fix test
* Fix lint
* Rebase, use ray.get w/ timeout, lint.
* lint
* fix rebase
* Address richard's comments
2019-11-18 01:14:41 -08:00
Stephanie Wang
66edebce3a
Spillback scheduling for direct task calls ( #6164 )
...
* add dac
* remove cachign
* rename return buffer
* cleanup
* add tests
* add perf
* fix
* flip
* remove
* remove it
* lint
* remove fork safety
* lint
* comments
* s/core/client
* wip
* remove
* fmt
* consistently return direct naming
* basic pass by ref
* fix bugs
* wip
* wip
* wip
* wip
* add test
* works now
* fix constructor
* fix merge
* add todo for perf
* fix single client test
* use lower n
* bazel
* faster
* fix core worker test
* init
* fix tests
* no plasma for direct call
* Update worker.py
* add order test
* fixes
* comments
* remove old assert
* lint
* add test
* Very wip
* wip
* add options for tasks
* add test
* fmt
* add backpressure
* remove idle prof event
* lint
* Fix 0 returns
* Set memcopy threads globally
* add benchmark
* Fix object exists
* Fix reference
* Remove return_buffer
* Add check
* add exit handler
* update benchmarks
* Fix compile error
* Fix NoReturn
* Use is instead of == for NoReturn
* fix
* Remove list comprehension
* Fix core worker test
* comment
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* fix merge error
* lint
* wip
* fix merge
* wip
* finish
* lint
* task interface
* add file
* add
* wip
* now works!
* updated
* wip
* dep resolution
* remove remote dep handling
* comments
* fix test_multithreading
* fix merge
* fix exit handling
* fix merge
* comments
* get fallback fetch working
* handle contains
* fix typo
* Skeleton for SubmitTask proto
* Update src/ray/common/id.h
Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>
* comments
* rename to core worker service
* lint
* fix compile
* wip
* update
* error code
* fix up and rename
* clean up call manager
* comments
* add test and cleanup deserialization
* fix pickle
* fix comments, lint
* test todo
* comments
* use shared ptr
* rename
* Update src/ray/protobuf/gcs.proto
Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>
* require transport type for ids; lint
* cleanup
* comments 1
* use worker available for real
* wip
* fix test
* resolve local dependencies test
* add num pending metric
* client factory
* unit test task submission
* wip
* fix bug
* rename
* Pass through node manager port, connect in raylet client
* finish rename
* Switch submit task to grpc
* fix crash
* Check port in use
* fix merge
* comments more
* doc
* Remove default port, set port randomly from driver
* add unique_ptr comment about TaskSpec
* lint
* fix test
* update
* fix lint
* GetMessageMutable should not be const
* iwyu
* fix const
* Update direct_task_transport_test.cc
* fix segfault
* Fix test
* Add RpcAddress, set in actor table data
* fix serialization
* fix lint
* Pass through task caller address
* Fix object manager test
* RpcAddress -> Address
* merge
* Port WorkerLease to grpc
* wip
* fix test
* add mem test
* update
* comments
* fix core worker tests
* fix
* remove old worker lease code
* First pass on spillback
* lint
* crash?
* Debug
* Fix task spec copy, extend test basic
* lint
* Port return worker to grpc
* lint
* Return worker to the correct raylet
* Only request worker if queued tasks
* A bit better failure handling
* Fix unit test
* Add unit test for spillback
* fix
* python test multinode
* update
* updates
* fix
2019-11-17 20:29:32 -08:00