Commit graph

106 commits

Author SHA1 Message Date
fangfengbin
c17404918c
[GCS]Add gcs table storage interface (#7949) 2020-04-15 10:48:12 +08:00
ZhuSenlin
4a81793ba5
GCS-Based actor management implementation (#6763)
* add gcs actor manager

* fix test_metrics.py

* fix TestTaskInfo

* fix comment

* fix comment

* fix comment

* fix comment

* fix comment

* fix comment

* fix compile error

* fix merge error

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-04-13 09:48:48 -07:00
micafan
c222d64ca1
[GCS] Add MessagePublisher to GCS (#7771) 2020-04-13 19:32:28 +08:00
mehrdadn
07002825aa
Proper command-line parsing (#7603)
* Command-line parsing functions

* Work around bug in MSVCRT for passing command-lines to programs

* Polishing

* Fix std::regex_replace() overload compatibility issue with GCC 4.8.x

* Try to work around linker error

* Implement ScanToken()

* Parse command-lines via ScanToken

* Merge src/ray/util.cc and src/ray/url.cc

Co-authored-by: Mehrdad <noreply@github.com>
2020-04-11 23:07:07 -07:00
Stephanie Wang
d7eef808b8
[core] Reconstruction for lost plasma objects (#7733)
* Add a lineage_ref_count to References

* Refactor TaskManager to store TaskEntry as a struct

* Refactor to fix deadlock between TaskManager and ReferenceCounter
Add references to task specs

* Pin TaskEntries and References in the lineage of any ObjectIDs in scope

* Fix deadlock, convert num_plasma_returns to a set of object IDs

* fix unit tests

* Feature flag

* Do not release lineage for objects that were promoted to plasma

* fix build

* fix build

* Remove num executions

* Remove num executions

* Add pinned locations to ReferenceCounter, empty handler for node death

* Fix num returns for actor tasks, fix Put return value

* Add regression test

* Clear pinned locations and callbacks on node removal

* Clear pinned locations and callbacks on node removal

* Simplify num return values

* Remove unused

* doc

* tmp

* Set num returns

* Move lineage pinning flag to ReferenceCounter

* comments

* Recover from plasma failures by pinning a new copy

* Basic object reconstruction, no concurrent reqs yet

* reconstruction test suite and a few fixes:
- fix for disabling lineage
- fix for updating submitted task refs

* Handle concurrent attempts to recover the same object

* Fix deadlock in DrainAndShutdown

* Revert "[core] Revert lineage pinning (#7499) (#7692)"

This reverts commit ba86a02b37.

* debug rllib

* debug rllib

* turn on all rllib tests again

* debug rllib

* Fix drain bug, check number of pending tasks

* revert rllib debug

* remove todo

* Trigger rllib tests

* revert rllib debug commit

* Split out logic into ObjectRecoveryManager

* Fix python tests

* Refactor to remove dependency on gcs client

* Unit tests

* Move pinned at node ID to direct memory store

* Unit test fixes and lint

* simplify and more tests

* Add ResubmitTask test for TaskManager

* Doc

* fix build

* comments

* Fix

* debug

* Update

* fix

* Fix

* Fix bad status handling, unit test

* Fix build
2020-04-11 16:52:57 -07:00
Kai Yang
48b48cc8c2
Support multiple core workers in one process (#7623) 2020-04-07 11:01:47 +08:00
micafan
e91595f955
[GCS] Add ObjectLocator to gcs server (#7557) 2020-04-07 10:37:24 +08:00
micafan
780c1c3b08
[GCS] impl RedisStoreClient for GCS Service (#7675) 2020-04-01 21:18:19 +08:00
SangBin Cho
c23e56ce9a
Metrics Export Service (#7809) 2020-03-30 23:28:32 -07:00
mehrdadn
f86e623095
Fix & improve GitHub Actions CI builds (#7784) 2020-03-30 16:29:54 -07:00
SongGuyang
c195dc8f88
Basic C++ worker implementation (#6125) 2020-03-27 23:01:08 +08:00
mehrdadn
e69664b74b
Miscellaneous Windows compatibility bugfixes (#7658)
* Windows compatibility bug fixes

* Use WSASend/WSARecv as WSASendMsg/WSARecvMsg do not work with TCP sockets

* Clean up some TODOs

* Fix duplicate compilations

* RedisAsioClient boost::asio::error::connection_reset

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-19 19:32:53 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support (#7080)
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
mehrdadn
a87199d240
Fix cyclic dependency between ray/util and ray/common (#7581)
* Fix cyclic dependency

Headers in ray/util should not depend on those in ray/common

* Move random generations to ray/common/test_util.h

* Add license header

Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-03-14 12:44:53 -07:00
mehrdadn
fc76586518
Redis on Windows (#7509)
* Switch hiredis on Windows to that of the Windows port of Redis

* Use boost::asio::ip::tcp::socket::native_handle_type

* Use normal hiredis instead of Windows-specific one

* Finish up using normal hiredis

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-09 18:49:54 -07:00
mehrdadn
5fb5be0ba5
Some bug fixes for Windows (#7374)
* Fix MAP_SHARED check in sys/mman.h

* Fix missing :platform_shims dependency for ray_util

* dlmalloc patch for Arrow
2020-02-28 10:22:32 -08:00
mehrdadn
0efaa9b310
Use Redis for Windows (#7364) 2020-02-28 10:18:56 -08:00
mehrdadn
8730996682
Windows changes (#7315) 2020-02-27 15:14:10 -08:00
fangfengbin
ba494b5281
Fix gcs client rpc operation disorder bug (#7283) 2020-02-26 19:24:24 +08:00
Edward Oakes
d190e73727
Use our own implementation of parallel_memcopy (#7254) 2020-02-21 11:03:50 -08:00
Eric Liang
5df801605e
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
mehrdadn
e09f63ad65
Fix build errors and add more targets to Windows builds (#6811)
* Fix common.fbs rename (due to apache/arrow/commit/bef9a1c251397311a6415d3dc362ef419d154caa)

* Add missing COPTS

* Use socketpair(AF_INET) if boost::asio::local is unavailable (e.g. on Windows)

* Fix compile bug in service_based_gcs_client_test.cc (fix build breakage in #6686)

* Work around googletest/gmock inability to specify override to avoid -Werror,-Winconsistent-missing-override

* Fix missing override on IsPlasmaBuffer()

* Fix missing libraries for streaming

* Factor out install-toolchains.sh

* Put some Bazel flags into .bazelrc

* Fix jni_md.h missing inclusion

* Add ~/bin to PATH for Bazel

* Change echo $$(date) > $@ to date > $@

* Fix lots of unquoted paths

* Add system() call checks for Windows

Co-authored-by: GitHub Web Flow <noreply@github.com>
2020-02-11 16:49:33 -08:00
mehrdadn
83c4e947c7
Make Cython rules more consistent for Bazel (#6840) 2020-02-10 10:45:54 -08:00
mehrdadn
ad4ac9aa70
Add clang-iwyu (#7081)
* Add iwyu

Co-authored-by: GitHub Web Flow <noreply@github.com>
2020-02-07 16:19:46 -08:00
fangfengbin
ade7ebfc0c
Add service based gcs client (#6686) 2020-02-05 12:06:25 +08:00
mehrdadn
bde575b8dd Revert "Use Boost.Process instead of pid_t (#6510)" (#6909)
This reverts commit fb8e3615d5.
2020-01-26 10:26:44 -06:00
Yunzhi Zhang
aa5427ca78 [Dashboard] Kill actor (#6906) 2020-01-24 17:21:44 -08:00
Yunzhi Zhang
0834bda8c1 [Dashboard] Display actor task execution info (#6705)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-01-22 22:33:55 -08:00
mehrdadn
139bf8908e Replace UNIX sockets with TCP sockets in Ray on Windows (#6823)
* Replace UNIX sockets with TCP sockets in Ray
2020-01-20 17:28:11 -08:00
mehrdadn
fb8e3615d5 Use Boost.Process instead of pid_t (#6510)
* Use Boost.Process instead of pid_t

This will let us handle child processes (mostly) uniformly across platforms.
TODO: There is no SIGTERM on Windows; achieving something equivalent is fairly involved.
2020-01-15 20:05:02 -08:00
mehrdadn
76c986bdc7 Windows compatibility stubs (#6706) 2020-01-05 21:21:17 -08:00
micafan
970cd78701 [GCS] refactor the GCS Client Dynamic Resource Interface (#6266) 2020-01-03 14:07:37 +08:00
micafan
a492333f4e [GCS] refactor the GCS Client Object Interface (#5695) 2019-12-27 15:18:54 +08:00
micafan
b98b288ffd [GCS] Change GCS Test to cc_test (#6596) 2019-12-26 14:34:35 +08:00
Chaokun Yang
7bbfa85c66 [Streaming] Streaming data transfer java (#6474) 2019-12-22 10:56:05 +08:00
fangfengbin
3c0164419b Add gcs server job info & actor info handler (#6469) 2019-12-20 14:28:04 +08:00
mehrdadn
7a24144bfd Polish Bazel build scripts (#6424)
* Polish Bazel build scripts

* Remove glog references from streaming_logging.cc

* Move out COPTS and reference them

* Disable streaming on Windows

* Remove -fno-gnu-unique
2019-12-17 02:38:36 -08:00
mehrdadn
74b2e871b7 Tentative workaround for some forks and signals on Windows (#6362)
* Platform shims for Windows

* Tentative workaround for some forks and signals on Windows

* Rewrite WorkerPool::StartProcess by moving spawnvp wrapper to a separate function

* Separate spawnvp the wrappers for POSIX and Windows

* Fix rv use
2019-12-16 16:57:49 -08:00
ZhuSenlin
6c0531683f Add gcs server as well as the unit test (#6401) 2019-12-15 13:23:42 +08:00
micafan
8c1520d18e [GCS] refactor the GCS Client Job Interface (#5503) 2019-12-12 16:57:32 +08:00
Chaokun Yang
6272907a57 [Streaming] Streaming data transfer and python integration (#6185) 2019-12-10 20:33:24 +08:00
micafan
668ce47360 [GCS]Add abstract interface of actor to GCS Client (#6269) 2019-12-05 13:38:29 +08:00
mehrdadn
75cc994e0a Update various build options relating to Windows (#6315)
* Update .bazelrc for Windows compatibility

* Block inclusion of (legacy) WinSock.h to avoid errors

* Suppress warnings for Windows code

* Include boost::asio in includes so that it is passed as -isystem to avoid warnings

* Link with -lpthread only on non-Windows

* Undefine BOOST_FALLTHROUGH, which is unnecessary and causes macro redefinition warnings

* Define RAY_STATIC and ARROW_STATIC to compile for Windows

* Add WinSock import library for Arrow
2019-12-01 15:05:50 -08:00
mehrdadn
b8cfdba752 Bazelify hiredis (#6203) 2019-11-29 15:32:45 -08:00
Stephanie Wang
f6a0408173
Track pending tasks with TaskManager (#6259)
* TaskStateManager to track and complete pending tasks

* Convert actor transport to use task state manager

* Refactor direct actor transport to use TaskStateManager

* rename

* Unit test

* doc

* IsTaskPending

* Fix?

* Shared ptr

* HUH?

* Update src/ray/core_worker/task_manager.cc

Co-Authored-By: Zhijun Fu <37800433+zhijunfu@users.noreply.github.com>

* Revert "HUH?"

This reverts commit f80f0ba204ff4da5e0b03191fa0d5a4d9f552434.

* Fix memory issue

* oops
2019-11-25 16:37:26 -08:00
Eric Liang
53641f1f74
Move more unit tests to bazel (#6250)
* move more unit tests to bazel

* move to avoid conflict

* fix lint

* fix deps

* seprate

* fix failing tests

* show tests

* ignore mismatch

* try combining bazel runs

* build lint

* remove tests from install

* fix test utils

* better config

* split up

* exclusive

* fix verbosity

* fix tests class

* cleanup

* remove flaky

* fix metrics test

* Update .travis.yml

* no retry flaky

* split up actor

* split basic test

* split up trial runner test

* split stress

* fix basic test

* fix tests

* switch to pytest runner for main

* make microbench not fail

* move load code to py3

* test is no longer package

* bazel to end
2019-11-24 11:43:34 -08:00
Ion
68ac08332b Initial commit of new cluster resource scheduler (#6178) 2019-11-22 11:14:46 -08:00
mehrdadn
ba86c75c21 Patch Cython in grpc to use our COPTS (#6223) 2019-11-21 15:32:48 -08:00
Simon Mo
29ba6bfc64
Basic Async Actor Call (#6183)
* Start trying to figure out where to put fibers

* Pass is_async flag from python to context

* Just running things in fiber works

* Yield implemented, need some debugging to make it work

* It worked!

* Remove debug prints

* Lint

* Revert the clang-format

* Remove unnecessary log

* Remove unncessary import

* Add attribution

* Address comment

* Add test

* Missed a merge conflict

* Make test pass and compile

* Address comment

* Rename async -> asyncio

* Move async test to py3 only

* Fix ignore path
2019-11-21 11:56:46 -08:00
Stephanie Wang
c0be9e6738
Resolve dependencies locally before submitting direct actor tasks (#6191)
* Priority queue in direct actor transport by task number

* Move LocalDependencyResolver out to separate file, share with direct actor transport

* works

* Test case for ordering

* Cleanups

* Remove priority queue

* comment

* Share ClientFactoryFn with direct actor transport

* Unit test

* fix
2019-11-20 16:45:19 -08:00