Commit graph

2240 commits

Author SHA1 Message Date
Stephanie Wang
f76ce836b2
Distributed ref counting for serialized ObjectIDs (#6945)
* Skeleton plus a unit test for simple borrower case

* First unit test passes - forward an ID and task returns with 1 submitted task pending on the inner ID

* Invariant for contained_in

* Unit test passes for testing task return without creating a borrower

* Wrap ref count functionality in test case

* Fix bad delete

* Unit test and fix for borrowers creating more borrowers

* Unit test and fix for simple borrowing, but owner sends call after borrower's ref count goes to 0

* Refactor:
- keep a sentinel ref count for task argument IDs
- keep contained_in_borrowed in addition to contained_in_owned

* Unit test for nested IDs passes

* Refactor so that an object ID can only be contained in 1 borrowed ID at a time

* Add check

* Fix

* Unit test (passes) to test nesting object IDs but no borrowers created

* Unit test for nested objects from different owners passes, refactor to unset contained_in when popping refs

* Unit tests for borrowers receiving an ObjectID from multiple sources,
skip adding ownership info if we already have it to handle duplicate
refs

* Unit test for returning object ID passes

* More unit tests for returning object IDs pass

* Add serialized ID tests

* fix serialization issue

* remove swap

* It builds!

* debugging and some fixes:
- register handler for WaitForRefRemoved
- don't create a python reference for arg IDs
- pass in client factory into ReferenceCounter
- fix bad decrement in PopBorrowerRefs

* Fix accounting for serialized IDs:
- don't decrement for IDs on dependency resolution, wait until task finished
- add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes

* mu_ -> mutex_

* lint

* fix build

* clear outer_object_id

* add direct call type check

* Fix test for direct call IDs and return IDs for actor calls

* Fix CoreWorkerClient.Addr()

* Remove unneeded lock

* Remove unnecessary ObjectID refs

* Fix worker holding serialized refs test

* Fix hex IDs

* fix

* fix tests

* fix tests

* refactor and cleanups

* lint

* Put inlined Ids in task args and some cleanup

* Add back gc.collect() line for test case

* Refactor and fixes:
- store inlined IDs in RayObject
- allow storing objects with inlined IDs in memory store
- pin objects that were promoted to plasma

* oops

* make sure worker ID is set in address, pass in rpc::Address to CoreWorkerClient

* todos

* cleanups and test builds

* Fix tests

* Add feature flag

* cleanups

* address comments and some cleanups

* cleanup

* fix recursive test

* Comments for tests

* Turn off ref counting by default

* Skip tests

* Fix some bugs for test_array.py, java build

* Don't include nested objects in the ref count when the feature flag is off

* C++ feature flag does not work...

* Remove

* Turn on python tests and add a warning when plasma objects are evicted before being pinned

* Fix build and remove irrelevant test

* Fix for java

* Revert "Fix build and remove irrelevant test"

This reverts commit 056cca9b263ed05b0f9ab2250907338edcbca2d5.

* Fix ray.internal.free

* Fixes and skip some flaky tests

* fix java build

* fix windows build

* Add IDs contained in owned objects

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* update

* Try to fix ::test_direct_call_serialized_id_eviction

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-18 18:21:34 -08:00
mehrdadn
4a12243336
Use Process instead of pid_t (round 2) (#6882)
* Revert "Revert "Use Boost.Process instead of pid_t (#6510)" (#6909)"

This reverts commit bde575b8dd.

* Process wrapper, using Boost.Process on Windows

- Reverts bde575b8dd.
- Re-applies fb8e3615d5 after some refactoring.

* Remove Boost.Process dependency

* Don't open /proc file on Linux

* Change FATAL to ERROR and modify error message when process doesn't exist
2020-02-18 17:44:46 -08:00
Eric Liang
0aa9373d62
Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
Eric Liang
5df801605e
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
ijrsvt
2116fd3bca
Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
mehrdadn
3bd82d0bcd
Fix various issues/warnings that come up on Jenkins (#7147)
* Avoid warning about swap being unlimited

Currently we get the following message on Jenkins:
"Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."

Since we're not limiting swap anyway, we might as well avoid trying to.
https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details

* Fix escaping in re.search()

* Fix escaping in _noisy_layer()

* Raise a more descriptive error when dashboard data isn't found

* Don't error on dashboard files not being found when webui isn't required

* Change dashboard error to a warning instead
2020-02-17 16:08:55 -08:00
Alex Wu
734629b4ea
Ssh command format (#7176) 2020-02-17 14:15:42 -08:00
Alind Khare
c6d768be14
[Serve] Added support for no http route services (#7010) 2020-02-17 11:31:30 -08:00
fyrestone
a6b8bd47b0
[xlang] Cross language serialize ActorHandle (#7134) 2020-02-17 20:44:56 +08:00
Edward Oakes
b079787c59
Fix flaky test_get_with_timeout (#7175) 2020-02-16 21:10:16 -08:00
Richard Liaw
94e2fcea2e
[sgd] fp16 (apex) and scheduler support + move examples page (#7061)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* fix tests'

* testmode

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-16 19:04:08 -08:00
Siyuan (Ryans) Zhuang
6745459f96
Apply cpython patch bpo-39492 for the reference counting issue in pickle5 (#7177)
* apply cpython patch bpo-39492 for the reference count issue
2020-02-15 21:16:13 -08:00
Edward Oakes
dc5a27dac0
Move ray.experimental.multiprocessing to ray.util.multiprocessing (#7149) 2020-02-14 16:17:05 -08:00
Richard Liaw
52d9189d5d
[autoscaler] port-forward for attach + redis_port (#7145)
* port-forward

* fixport

* force redis port in init mode

* test

* Update python/ray/tests/test_ray_init.py
2020-02-14 15:17:00 -08:00
Qing Wang
f3703bafa3
[Java] Support concurrent actor calls API. (#7022)
* WIP

Temp change

Attach native thread to jvm

* Fix run mode

* Address comments.
2020-02-14 13:02:39 +08:00
Alex Wu
0d3687a10d
No warning for docker memory > system memory (#7151) 2020-02-13 15:21:44 -08:00
Qing Wang
94a286ef1d
[Java] Add session_dir as temp_dir for logs, socket files like Python (#7044)
* Support

* Add gcs_server support

* Fix ut

* Fix

* Remove unused py code

* Fix linting

* Fix cross language ci

* Fix CI

* Add docstring

* Fix

* Fix linting

* Add a singleton for config

* Refine

* fix

* Fix

* linting

* Remove FileUnit

* Fix

* Fix

* Fix

* Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Fix streaming singleprocess CI

* Fix checkstyle

Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-02-13 17:49:52 +08:00
Edward Oakes
e904711e74
Add python tests for serialized object ID reference counting (#7038) 2020-02-12 16:52:07 -08:00
Edward Oakes
d91d3ea936
Split half of test_actor into test_actor_advanced (#7143) 2020-02-12 15:17:25 -08:00
Simon Mo
0e94e1dc2a
[Asyncio] Increase recursion limit manually (#7142) 2020-02-12 14:15:36 -08:00
Mitchell Stern
5dda0b66bf
[Dashboard] Refactor dialogs to use parent component state instead of routes (#7129) 2020-02-12 10:59:47 -08:00
aannadi
d941ac6c89
Updating package-lock.json with latest npm (#7128) 2020-02-12 09:54:20 -08:00
Eric Liang
305eaaabe9
Fix hang if actor object id is returned from a task that exits (#6885) 2020-02-11 20:28:13 -08:00
Simon Mo
039d2cde88
Change log level for OMP warning (#7114) 2020-02-11 14:15:38 -08:00
aannadi
d7ff55852a
[tune][Dashboard] Added Tune Dashboard (#6911) 2020-02-11 11:56:49 -08:00
Simon Mo
0ddc389830
Fix documentation building with psutil issue (#7077) 2020-02-11 10:00:29 -08:00
Eric Liang
58c94f6381
[core] Delete() should never remote objects from in-memory store (#7117) 2020-02-10 22:40:09 -08:00
Maksim Smolin
4139e02f01
[autoscaler] Add `--all-nodes` option to rsync-up (#7065)
* Add option to sync workers to rsync-up

* Format

* Rename --sync-workers to --all-nodes
2020-02-10 16:27:59 -08:00
Sven Mika
6e1c3ea824
[RLlib] Exploration API (+EpsilonGreedy sub-class). (#6974) 2020-02-10 15:22:07 -08:00
SangBin Cho
1e690673d8
Render tasks that are not schedulable on the dashboard. (#7034) 2020-02-10 14:23:06 -08:00
Alex Wu
3f99be8dad
Add 'ray dashboard' command (#6959) 2020-02-10 12:55:21 -08:00
Alex Wu
72c31e3e19
Ray nodes should respect docker limits (#7039) 2020-02-10 11:08:38 -08:00
chaokunyang
247a4d022a Fix passing empty bytes in python tasks (#7045)
* ensure data_ won't be null_ptr when size == 0

* when data_sizes[i] == 0, we should Allocate an empty buffer

* work around for pyarrow.py_buffer

* fix comments

* add null ptr check

* add test for bytes

* lint
2020-02-10 12:07:29 +08:00
fangfengbin
694c0f2867
[Java] Enable GCS server when running java unit tests (#7041)
* enable gcs service when run java testcase

* fix ci bug

* fix windows compile bug

* fix ci bug

* restart ci job

* enable java testcase

* restart ci job

* restart ci job

* add debug log

* add debug log

* restart ci job

* add debug log

* restart ci

* add debug log

* fix java testcase bug

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job
2020-02-10 09:39:14 +08:00
Eric Liang
48e2adbc21
[tune] Remove unused TF loggers (#7090) 2020-02-09 13:58:24 -08:00
Ujval Misra
98a07fe37e [tune] Asynchronous saves (#6912)
* Support asynchronous saves

* Fix merge issues

* Add test, fix existing tests

* More informative warning

* Lint, remove print statements

* Address comments, add checkpoint.is_resolved fn

* Add more detailed comments
2020-02-09 12:17:45 -08:00
fyrestone
0648bd28ef [xlang] Cross language Python support (#6709) 2020-02-08 13:01:28 +08:00
Alind Khare
f146d05b36
[Serve] Added support for composing arbitrary DAGs (#7015) 2020-02-07 17:55:26 -08:00
Stephanie Wang
3333ee84a5
Fix ref counting (#7075) 2020-02-06 14:35:08 -08:00
Simon Mo
a0ba4499ac
[Serve] Fix batching bug 2020-02-05 14:18:19 -08:00
ijrsvt
0826f95e1c
Including psutil & setproctitle (#7031) 2020-02-05 14:16:58 -08:00
Sven Mika
93ed86f175
[Tune] logger.py: Relax TBX Summary ValueErrors with e.g. empty lists in lists (and all… (#6987) 2020-02-05 12:02:39 -08:00
fangfengbin
ade7ebfc0c
Add service based gcs client (#6686) 2020-02-05 12:06:25 +08:00
Eric Liang
37053443b4
Restore set omp (#7051) 2020-02-04 15:02:23 -08:00
Simon Mo
dd095c476a
Move serve and asyncio tests to bazel (#6979) 2020-02-04 08:29:16 -08:00
Edward Oakes
844f607c93
Collect contained ObjectIDs during deserialization (#7029) 2020-02-03 22:49:14 -08:00
Simon Mo
5e8ded344a
[Serve] Fix flaky test with nursery double init (#6982) 2020-02-03 21:32:12 -08:00
Edward Oakes
984490d2be
Collect object IDs during serialization (#6946) 2020-02-03 18:38:11 -08:00
SangBin Cho
ca5a9c6739
Exclude test profiling info endpoint (#7030)
* Skip test_profiling_info_endpoint when pytest running locally

* Fixed formatting.

* Fixed the reason for skipping the test based on pr comments
2020-02-03 16:49:03 -08:00
Siyuan (Ryans) Zhuang
42cbf801e1
workaround for python3.5 fast numpy serialization (#6675) 2020-02-03 13:08:18 -08:00