Commit graph

662 commits

Author SHA1 Message Date
Robert Nishihara
067976ad3d Push a warning to all users when large number of workers have been started. (#3645)
* Push a warning to all users when large number of workers have been started.

* Add test.

* Fix bug.

* Give warning when worker starts instead of when worker registers.

* Fix

* Fix tests
2019-01-05 13:27:32 -08:00
Robert Nishihara
b6bcd18d65 Split profile table among many keys in the GCS. (#3676)
* Divide profile table among many keys in GCS.

* Fix, and remove --collect-profiling-data arg.

* Remove reference in doc.
2019-01-02 21:33:01 -08:00
Yuhong Guo
93e9d2b82c Improve backend log: env variable setting and format refine. (#3662)
* Improve backend logging

* Address comment

* Fix Raul's comment
2019-01-01 21:45:29 -08:00
Zhijun Fu
382b138fc7 fix code issues in object manager that are reported by scanning tool (#3649)
Fix some code issues found by code scanning tool:

**1. Macro compares unsigned to 0(NO_EFFECT)**

CWE570: An unsigned value can never be less than 0
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "this->create_buffer_state_[object_id].num_seals_remaining >= 0UL".

~/ray/src/ray/object_manager/object_buffer_pool.cc: ray::ObjectBufferPool::SealChunk(const ray::UniqueID &, unsigned long)

**2. Inferred misuse of enum(MIXED_ENUMS)**

CWE398: An integer expression which was inferred to have an enum type is mixed with a different enum type
This case, "static_cast(ray::object_manager::protocol::MessageType::PushRequest)", implies the effective type of "message_type" is "ray::object_manager::protocol::MessageType".

~/ray/src/ray/object_manager/object_manager.cc: ray::ObjectManager::ProcessClientMessage(std::shared_ptr> &, long, const unsigned char *)
2018-12-28 14:38:59 -08:00
Zhijun Fu
3df1e1c471 Add missing lock in FreeObjects of object buffer pool (#3647)
Object manager uses multi-threading for transferring objects between different nodes, the plasma client used in object_buffer_pool_ needs to be protected by lock. We have met crashes caused by missing lock in FreeObjects() interface, this PR fixes that issue.
2018-12-28 11:47:31 -08:00
Hao Chen
0b682d043e Fix memory leak in PyRayletCient (#3640)
1) if using `PyObject_GetIter`, the caller must call `Py_DECREF` to avoid memory leak. But with `PyList_GetItem`, `Py_DECREF` isn't needed.
2) the `Py_BuildValue` call in `wait` doesn't need to increment ref count.
2018-12-27 17:39:02 -08:00
Hao Chen
f4011754d6 Fix: ServerConnection should be closed before being removed (#3626)
Otherwise, in the event of a remote raylet crashing, the connection might be held by boost asio forever, and the pending callbacks will never get invoked. See also #3586.
2018-12-25 11:01:53 -08:00
Robert Nishihara
ddd4c842f1 Initialize some variables in constructor instead of header file. (#3617)
* Initialize some variables in constructor instead of header file
2018-12-23 02:44:23 -08:00
Alexey Tumanov
bada42c334 object store notification mgr: fix using uninitialized variables (#3592)
Initialize private class variables to avoid valgrind errors. They are used before initialization.
2018-12-22 19:51:22 -08:00
Philipp Moritz
e578a38116 Fix TensorFlow and PyTorch compatibility (#3574)
* remove tensorflow workaround
* update docker
* add boost threads
* add date_time, too
* change link order
* cosmetics
2018-12-22 13:25:48 -08:00
Alexey Tumanov
6b179cb8a7 change the order of allocation for io_service and gcs client in raylet main (#3597) 2018-12-21 00:13:28 -08:00
Hao Chen
132a23354e Fix pending callback not called when ServerConnection destructs (#3572) 2018-12-19 17:29:36 -08:00
Yuhong Guo
fb33fa9097 Enable function_descriptor in backend to replace the function_id (#3028) 2018-12-18 18:53:59 -05:00
Stephanie Wang
26ca40817e Convert UniqueID::nil() to a constructor (#3564)
* Initialize UniqueID to nil

* Return reference to static const variable
2018-12-18 11:59:02 -08:00
Yuhong Guo
75ddf7cca4 Fix 2 small bugs (#3573) 2018-12-18 14:52:21 -05:00
Robert Nishihara
417c7f2d6f Update arrow and remove plasma_manager references. (#3545) 2018-12-15 23:36:02 -08:00
Philipp Moritz
b3bf608608 Update arrow to reduce plasma IPCs. (#3497) 2018-12-14 23:49:37 -05:00
Stephanie Wang
fcc37021b2
Throw exception for ray.get of an evicted actor object (#3490)
* Add a flag for whether an object has been created before

* Add regression test

* doc

* Share object directory between object and node managers

* Treat evicted actor tasks as failed

* minor

* Check return value

* Fix bug where object locations weren't getting updated on client death

* Fix mac build

* Use RayTaskError
2018-12-14 11:41:27 -08:00
Yuhong Guo
a4abe6c0fe Add test to test raylet client connection when raylet crashes. (#3518) 2018-12-13 23:40:50 -08:00
Hao Chen
e7b51cbd1b [xray] Implement Actor Reconstruction (#3332)
* Implement Actor Reconstruction

* fix

* fix actor handle __del__

* fix lint

* add comment

* Remove actorCreationDummyObjectId

* address comments

* fix

* address comments

* avoid copy

* change log to debug

* fix error name
2018-12-13 21:28:58 -08:00
Alexey Tumanov
2455de78ce save initial config instead of initial resource config (#3532) 2018-12-13 20:39:42 -08:00
Si-Yuan
84fae57ab5 Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511)
* refactoring

* fix bugs

* create client class

* create client class for java; bug fix

* remove legacy code

* improve code by using std::string, std::unique_ptr rename private fields and removing legacy code

* rename class

* improve naming

* fix

* rename files

* fix names

* change name

* change return types

* make a mutex private field

* fix comments

* fix bugs

* lint

* bug fix

* bug fix

* move too short functions into the header file

* Loose crash conditions for some APIs.

* Apply suggestions from code review

Co-Authored-By: suquark <suquark@gmail.com>

* format

* update

* rename python APIs

* fix java

* more fixes

* change types of cpython interface

* more fixes

* improve error processing

* improve error processing for java wrapper

* lint

* fix java

* make fields const

* use pointers for [out] parameters

* fix java & error msg

* fix resource leak, etc.
2018-12-13 13:39:10 -08:00
Eric Liang
20c7fad4f4
Move actor table to primary redis context 2018-12-12 16:51:29 -08:00
Eric Liang
cffe8f9806 Add option to evict keys LRU from the sharded redis tables (#3499)
* wip

* wip

* format

* wip

* note

* lint

* fix

* flag

* typo

* raise timeout

* fix

* optional get

* fix flag

* increase timeout in test

* update docs

* format
2018-12-09 05:48:52 -08:00
Yuhong Guo
0136af5aac Add return value for recontruction RPC. (#3493)
* Add return value for recontruct RPC.

* Fix comment function name
2018-12-09 00:08:44 -08:00
Stephanie Wang
4abafd7e62 Fix bug in ray.wait (#3445)
ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely:

1. Objects A and B are put in the cluster.
2. Client calls ray.wait([A, B], num_returns=1).
3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each.
4. Callback for A fires. The wait completes and the request is removed.
5. Callback for B fires. The wait request no longer exists and raylet crashes.
2018-12-01 19:40:33 -08:00
Stephanie Wang
48a5935224 Fault tolerance for actor creation (#3422)
* Add regression test

* Request actor creation if no actor location found

* Comments

* Address comments

* Increase test timeout

* Trigger test
2018-11-29 10:48:35 -08:00
Tianming Xu
139fbf7884 Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403) 2018-11-27 23:51:18 -08:00
Eric Liang
c2108ca64f Don't put entire actor registry in debug string since it's too long (#3395) 2018-11-27 16:48:12 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache (#3366)
* Move children_ map inside Lineage

* Update lineage_cache.cc

* Test and fixes

* Remove unused
2018-11-21 16:18:39 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death (#3359)
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Eric Liang
686cf20951 Remove uses of std::list::size (#3358)
* worker pool and client conn

* Fix linting

* unordered set

* move
2018-11-20 14:47:55 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient (#3324)
* put queues outside

* working version, still needs to be optimized

* implement round robin

* proper round robin

* fix spillback

* update

* fix

* cleanup

* more cleanups

* fix

* fix

* add documentation

* explanation for hash combiner

* speed it up

* cleanup and linting

* linting

* comments

* Update scheduling_queue.h

* temp commit

* fixes

* update

* fix

* cleanup

* cleanup

* lint

* more prints

* more prints

* increase sleep

* documentation

* sleep

* fix

* fix

* sleep longer

* update

* fix

* fix

* fix

* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* fixes

* use ordered set

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator

* fix

* fix test

* linting

* lint

* update

* add documentation

* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2 Batch heartbeats from node manager together in the monitor. (#3011) 2018-11-20 09:52:27 -08:00
Robert Nishihara
f2b5500642 Add ordered_set container. (#3352)
* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d Don't retry IPC connect an absurd number of times (#3355) 2018-11-19 16:23:59 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. (#3276)
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00
Robert Nishihara
60b22d9a72 Don't unsubscribe dependencies for infeasible tasks. (#3338)
* Make scheduling queues RemoveTasks return task states as well.

* Add test

* Don't unsubscribe for infeasible tasks when spilling over.

* Linting

* Address comments.
2018-11-16 11:33:00 -08:00
Eric Liang
e0bf9d7305 Add debug string to raylet (#3317)
* initial debug string

* format

* wip debug string

* fix compile

* fix

* update

* finished

* to file

* logs dir

* use temp root

* fix

* override
2018-11-15 21:47:50 -08:00
Philipp Moritz
1be1455d86 Fix redis crash when duplicate messages are appended to log. (#3316) 2018-11-15 15:09:39 -08:00
Philipp Moritz
b6a12d1f97 Fix socket retry message (#3325) 2018-11-15 12:14:19 -08:00
Stephanie Wang
577c1dda74 Release sender connections as soon as WriteMessageAsync completes (#3313) 2018-11-13 21:32:24 -05:00
Ion
d681893b0f Speed up task dispatch. (#3234)
* speed up task dispatch

* minor changes

* improved comments

* improved comments

* change argument of DispatchTasks to list of tasks

* dispatch only tasks whose dependencies have been fullfiled

* some updated comments

* refactored DispatchQueue() and Assigntask() to avoid the copy of the ready list

* minor fixes

* some more minor fixes

* some more minor fixes

* added more comments

* better comments?

* fixed all feedback comments, minus making the argument of AssignTask() const

* Assigntask() now taskes a const argument

* Do the task copy outside of the callback

* fix linting
2018-11-10 09:55:12 -08:00
Eric Liang
9b2794101d
[minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228) 2018-11-08 00:04:20 -08:00
Stephanie Wang
d950e92f63
Allow multiple threads to call ray.get and ray.wait (#3244)
* Handle multiple threads calling ray.get

* Multithreaded ray.wait

* Pass in current task ID in java backend

* Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get

* Fix test

* Some cleanups

* Improve error message

* Add assertion

* Cleanup, throw error in HandleTaskUnblocked if task not actually blocked

* lint

* Fix python worker reset

* Fix references to reconstruct_objects

* Linting

* java lint

* Fix java

* Fix iterator
2018-11-07 22:39:28 -08:00
Richard Liaw
0bab8ed95c
Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Eric Liang
29e3362905
Better errors on process deaths (#3252) 2018-11-07 14:08:16 -08:00
Robert Nishihara
1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Philipp Moritz
4182b85611 Cache resources in SchedulingQueue (#3232)
* cache resources

* fix

* documentation and remove old code

* fix PR

* update documentation

* linting
2018-11-06 21:23:31 -08:00
Stephanie Wang
ca585703b2 Refactor ObjectDirectory to reduce and fix callback usage (#3227) 2018-11-06 20:33:10 -08:00