Commit graph

1638 commits

Author SHA1 Message Date
Yuhong Guo
0136af5aac Add return value for recontruction RPC. (#3493)
* Add return value for recontruct RPC.

* Fix comment function name
2018-12-09 00:08:44 -08:00
Stephanie Wang
4abafd7e62 Fix bug in ray.wait (#3445)
ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely:

1. Objects A and B are put in the cluster.
2. Client calls ray.wait([A, B], num_returns=1).
3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each.
4. Callback for A fires. The wait completes and the request is removed.
5. Callback for B fires. The wait request no longer exists and raylet crashes.
2018-12-01 19:40:33 -08:00
Stephanie Wang
48a5935224 Fault tolerance for actor creation (#3422)
* Add regression test

* Request actor creation if no actor location found

* Comments

* Address comments

* Increase test timeout

* Trigger test
2018-11-29 10:48:35 -08:00
Tianming Xu
139fbf7884 Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403) 2018-11-27 23:51:18 -08:00
Eric Liang
c2108ca64f Don't put entire actor registry in debug string since it's too long (#3395) 2018-11-27 16:48:12 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache (#3366)
* Move children_ map inside Lineage

* Update lineage_cache.cc

* Test and fixes

* Remove unused
2018-11-21 16:18:39 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death (#3359)
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Eric Liang
686cf20951 Remove uses of std::list::size (#3358)
* worker pool and client conn

* Fix linting

* unordered set

* move
2018-11-20 14:47:55 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient (#3324)
* put queues outside

* working version, still needs to be optimized

* implement round robin

* proper round robin

* fix spillback

* update

* fix

* cleanup

* more cleanups

* fix

* fix

* add documentation

* explanation for hash combiner

* speed it up

* cleanup and linting

* linting

* comments

* Update scheduling_queue.h

* temp commit

* fixes

* update

* fix

* cleanup

* cleanup

* lint

* more prints

* more prints

* increase sleep

* documentation

* sleep

* fix

* fix

* sleep longer

* update

* fix

* fix

* fix

* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* fixes

* use ordered set

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator

* fix

* fix test

* linting

* lint

* update

* add documentation

* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2 Batch heartbeats from node manager together in the monitor. (#3011) 2018-11-20 09:52:27 -08:00
Robert Nishihara
f2b5500642 Add ordered_set container. (#3352)
* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d Don't retry IPC connect an absurd number of times (#3355) 2018-11-19 16:23:59 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. (#3276)
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00
Robert Nishihara
60b22d9a72 Don't unsubscribe dependencies for infeasible tasks. (#3338)
* Make scheduling queues RemoveTasks return task states as well.

* Add test

* Don't unsubscribe for infeasible tasks when spilling over.

* Linting

* Address comments.
2018-11-16 11:33:00 -08:00
Eric Liang
e0bf9d7305 Add debug string to raylet (#3317)
* initial debug string

* format

* wip debug string

* fix compile

* fix

* update

* finished

* to file

* logs dir

* use temp root

* fix

* override
2018-11-15 21:47:50 -08:00
Philipp Moritz
1be1455d86 Fix redis crash when duplicate messages are appended to log. (#3316) 2018-11-15 15:09:39 -08:00
Philipp Moritz
b6a12d1f97 Fix socket retry message (#3325) 2018-11-15 12:14:19 -08:00
Stephanie Wang
577c1dda74 Release sender connections as soon as WriteMessageAsync completes (#3313) 2018-11-13 21:32:24 -05:00
Ion
d681893b0f Speed up task dispatch. (#3234)
* speed up task dispatch

* minor changes

* improved comments

* improved comments

* change argument of DispatchTasks to list of tasks

* dispatch only tasks whose dependencies have been fullfiled

* some updated comments

* refactored DispatchQueue() and Assigntask() to avoid the copy of the ready list

* minor fixes

* some more minor fixes

* some more minor fixes

* added more comments

* better comments?

* fixed all feedback comments, minus making the argument of AssignTask() const

* Assigntask() now taskes a const argument

* Do the task copy outside of the callback

* fix linting
2018-11-10 09:55:12 -08:00
Eric Liang
9b2794101d
[minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228) 2018-11-08 00:04:20 -08:00
Stephanie Wang
d950e92f63
Allow multiple threads to call ray.get and ray.wait (#3244)
* Handle multiple threads calling ray.get

* Multithreaded ray.wait

* Pass in current task ID in java backend

* Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get

* Fix test

* Some cleanups

* Improve error message

* Add assertion

* Cleanup, throw error in HandleTaskUnblocked if task not actually blocked

* lint

* Fix python worker reset

* Fix references to reconstruct_objects

* Linting

* java lint

* Fix java

* Fix iterator
2018-11-07 22:39:28 -08:00
Richard Liaw
0bab8ed95c
Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Eric Liang
29e3362905
Better errors on process deaths (#3252) 2018-11-07 14:08:16 -08:00
Robert Nishihara
1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Philipp Moritz
4182b85611 Cache resources in SchedulingQueue (#3232)
* cache resources

* fix

* documentation and remove old code

* fix PR

* update documentation

* linting
2018-11-06 21:23:31 -08:00
Stephanie Wang
ca585703b2 Refactor ObjectDirectory to reduce and fix callback usage (#3227) 2018-11-06 20:33:10 -08:00
Wang Qing
4968cc5d70 Fix a small typo (#3240) 2018-11-05 18:30:53 -08:00
Stephanie Wang
bf88aa5013
Increase timeout before reconstruction is triggered (#3217)
* Increase timeout to 10s

* Skip eviction reconstruction tests

* Add stress test for many actors to one

* Fix test by shortening it.

* lower number of processes in stress test

* Skip slow test
2018-11-05 18:03:50 -08:00
Ion
d8ae9de99c Caching task resource requirements. (#3231)
* caching resource requirements

* small fixes

* avoid copying the resource map
2018-11-05 15:14:09 -08:00
Philipp Moritz
0da15b1c1f Fix build system dependency for local_scheduler_client (#3215) 2018-11-03 13:19:02 -07:00
Stephanie Wang
aacbd007a0
[xray] Implement faster flush policy for lineage cache (#3071)
* Policy that flushes the lineage stash immediately

* Fix bug where remote tasks in uncommitted lineage weren't getting subscribed to, add reg test

* test

* Fix bug where waiting task was getting subscribed

* Cleanup

* Update src/ray/raylet/lineage_cache.cc

Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>

* Update src/ray/raylet/lineage_cache.cc

Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>

* cleanup

* cleanup

* Add another test for task with many parents

* fix, unsubscribe to new waiting tasks

* Unsubscribe as soon as the commit notification is handled
2018-10-30 09:59:50 -07:00
Robert Nishihara
fd854ff090 Allow the node manager port and object manager port to be set through… (#3130)
* Allow the node manager port and object manager port to be set through ray start.

* Linting

* Fix Java test

* Address comments.
2018-10-28 17:28:41 -07:00
Yuhong Guo
befbf78048 Delete empty pubsub keys (#3146)
We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.).
This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.
2018-10-27 11:58:39 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Robert Nishihara
9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Philipp Moritz
8d8b6e5bfa Retry connections to redis for async and subscribe contexts (#3105)
This is fixing a problem that @devin-petersohn observed on the windows subsystem for linux.

In theory, redis should be up once the async connect is happening and there should be no retries needed for the async connect. However on the windows subsystem for linux, the async connect was failing even though the synchronous one was working. Maybe windows has a different semantics here than linux.
2018-10-22 22:31:13 -07:00
Wang Qing
a4db5bbaea Fill driver id into actor notification when finishing assigned task. (#3080)
## What do these changes do?
Fill driver id into actor notification when finishing assigned task.
Also it improves codes.
2018-10-21 11:12:20 +08:00
Eric Liang
9d23fa03c9 [xray] All messages on main asio event loop should be written asynchronously (#3023)
* copy over ref code

* wip async writes

* compiles

* fix error handling

* add test

* amend

* fix test

* clang fmgt

* clang format

* wip

* yapf

* rename format script

* test error

* clangfmt

* add test to list

* warn

* ref test

* fix test

* comment

* add capture

* Update client_connection.cc

* wip

* fix compile
2018-10-18 21:56:22 -07:00
Yuhong Guo
653c5b114a [c++] Refine Log Code (#2816)
* Support setting logging level from env variable

* Remove Env Variable related code

* lint
2018-10-18 10:51:36 -07:00
Peter Schafhalter
a41bbc10ef Add password authentication to Redis ports (#2952)
* Implement Redis authentication

* Throw exception for legacy Ray

* Add test

* Formatting

* Fix bugs in CLI

* Fix bugs in Raylet

* Move default password to constants.h

* Use pytest.fixture

* Fix bug

* Authenticate using formatted strings

* Add missing passwords

* Add test

* Improve authentication of async contexts

* Disable Redis authentication for credis

* Update test for credis

* Fix rebase artifacts

* Fix formatting

* Add workaround for issue #3045

* Increase timeout for test

* Improve C++ readability

* Fixes for CLI

* Add security docs

* Address comments

* Address comments

* Adress comments

* Use ray.get

* Fix lint
2018-10-16 22:48:30 -07:00
Robert Nishihara
faa31ae018 Introduce concept of resources required for placing a task. (#2837)
* Introduce concept of resources required for placement.
* Add placement resources to task spec
* Update java worker
* Update taskinfo.java
2018-10-04 10:35:39 -07:00
Richard Liaw
01bb073569 Suppress errors when worker or driver intentionally disconnects. (#2935) 2018-10-04 00:06:34 -07:00
Robert Nishihara
3ce8eb2d4c Test dying_worker_get and dying_worker_wait for xray. (#2997)
This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.
2018-10-02 00:08:47 -07:00
Wang Qing
a879302355 Improve log message when failing to fork worker process (#2990)
## What do these changes do?
```c++
  // Try to execute the worker command.
  int rv = execvp(worker_command_args[0],
                  const_cast<char *const *>(worker_command_args.data()));
  // The worker failed to start. This is a fatal error.
  RAY_LOG(FATAL) << "Failed to start worker with return value " << rv;
```
When starting a process fails, the return value `rv` always be set to -1.
It is useless for us.
The log message should show some meaningful infos.

For example, If we did't install java. The message showed for us should be:
```shell
 Failed to start worker: No such file or directory.
```
This could help us to locate issue quickly.

## Related issue number
N/A
2018-09-29 22:10:57 +08:00
Hao Chen
971df5ea8a [java] put function meta in task spec and load functions with function meta (#2881)
This PR adds a `function_desc` field into task spec. a function descriptor is a list of strings that can uniquely describe a function.
- For a Python function, it should be: [module_name, class_name, function_name]
- For a Java function, it should be: [class_name, method_name, type_descriptor]

There're a couple of purposes to add this field:

In this PR:
- Java worker needs to know function's class name to load it. Previously, since task spec didn't have such a field to hold this info, we did a hack by appending the class name to the argument list. With this change, we fixed that hack and significantly simplified function management in Java.

Will be done in subsequent PRs:
- Support cross-language invocation (#2576): currently Python worker manages functions by saving them in GCS and pass function id in task spec. However, if we want to call a Python function from Java, we cannot save it in GCS and get the function id. But instead, we can pass the function descriptor (module name, class name, function name) in task spec and use it to load the function.
- Support deployment: one major problem of Python worker's current function management mechanism is #2327. In prod env, we should have a mechanism to deploy code and dependencies to the cluster. And when code is already deployed, we don't need to save functions to GCS any more and can use `function_desc` to manage functions.
2018-09-25 23:05:05 -07:00
Hanwei Jin
9f9e49e4a1 [cmake] enable using thirdparty env variable to find installed dependency (#2912)
* enable using thirdparty env variable to find installed dependency, to speed up the build process

* fix target dependency in cmake. :-) too chaos in each CMakeLists

* check env variable defined directory exists
2018-09-23 07:52:33 -07:00
Yuhong Guo
b29839a0a3 Fix node manager failure when ClientTable has a disconnected entry. (#2905)
When a new raylet starts, `ClientAdded` will be called with the disconnected client data. However, since the client was closed, the connection will fail.
2018-09-21 22:45:06 -07:00
Hao Chen
715ec1bca5 Modularize NodeManager::ProcessClientMessage (#2895)
Split NodeManager::ProcessClientMessage into a couple of smaller functions, each of which handles one type of message.
2018-09-18 14:18:34 -07:00
Yuhong Guo
a8248e8628 Fix ObjectManager Crash (#2833)
Fixes issue where object manager sometimes crashes within the `Wait` method: The issue stems from inconsistent behavior of the boost deadline timer's `cancel` method, which is invoked within `WaitComplete` to enforce exactly one `WaitComplete` invocation for each `Wait` request. The `cancel` method sometimes fails to actually prevent the timer's invocation of the provided handler with non-zero error code.
2018-09-16 02:14:13 -04:00
Philipp Moritz
47d2f82c6c Fix common cmake dependencies (#2876) 2018-09-15 22:11:12 -07:00