Commit graph

658 commits

Author SHA1 Message Date
Yuhong Guo
ba3fe04629 Fix message type to string crash (#4308)
* Fix message string crash

* Fix
2019-03-09 13:51:02 -08:00
Stephanie Wang
edc794751f Set TCP_NODELAY on all TCP connections (#4318) 2019-03-09 12:15:29 -08:00
Yuhong Guo
b9ea821d16
Use strongly typed IDs in C++. (#4185)
*  Use strongly typed IDs for C++.

* Avoid heap allocation in cython.

* Fix JNI part

* Fix rebase conflict

* Refine

* Remove type check from __init__

* Remove unused constructor declarations.
2019-03-07 21:43:01 +08:00
Stephanie Wang
0ccaf118a2
Disconnect object manager clients if receiving an object fails (#4141)
* Disconnect object manager clients if ReadBuffer fails

* unused

* put back EINTR handling
2019-03-05 22:08:26 -08:00
Stephanie Wang
8b871af555
Fix ray.wait bug for tasks on remote nodes and timeout=0 (#4242)
* Regression test

* Fix

* cleaner code
2019-03-04 11:46:06 -08:00
Yuhong Guo
6f46edca51 Skip dead nodes to avoid connection timeout. (#4154) 2019-03-02 13:11:19 -08:00
Hao Chen
484708d44d
Fix JNI throwing exception (#4178) 2019-02-28 15:11:25 +08:00
Philipp Moritz
615d5516d1 Compile valgrind tests with Bazel (#4144) 2019-02-24 00:00:49 -08:00
Philipp Moritz
ba52caff37 Make Bazel the default build system (#3898) 2019-02-23 11:58:59 -08:00
Philipp Moritz
9b3ce3e64b Revert inline objects PR (#4125)
* Revert "Inline objects (#3756)"

This reverts commit f987572795.

* fix rebase problems

* more rebase fixes

* add back debug statement
2019-02-22 18:21:01 -08:00
Tianming Xu
692bb336a1 Fix master branch compilation error and lint error (#4109) 2019-02-21 11:54:30 -08:00
Yuhong Guo
3549cd8195
Add the Delete function in GCS (#4081)
* Add the Delete function in GCS

* Unify BatchDelete and Delete

* Fix comment

* Lint

* Refine according to comments

* Unify test.

* Address comment

* C++ lint

* Update ray_redis_module.cc
2019-02-21 13:33:37 +08:00
Hao Chen
de17443dc2
Propagate backend error to worker (#4039) 2019-02-16 11:39:15 +08:00
Stephanie Wang
3684e5bc0d Fix memory leak in Redis by using auto memory management (#4054)
* Table appends should always succeed

* Use Redis auto memory management

* Remove unneeded namespace
2019-02-14 19:51:18 -08:00
Philipp Moritz
810cc17062 Fix LRU eviction of client notification datastructure (#4021)
* convert notification_key map to C++ datastructure

* fix crash and add debug string

* clean notification map up (this was a bug before)

* remove checks

* add jenkins test

* linting

* fixes

* properly erase

* clean up

* linting

* Update test_wait_hanging.py

* Update run_multi_node_tests.sh

* increase redis_max_memory

* fix dat jenkins

* update

* Update run_multi_node_tests.sh
2019-02-13 22:20:27 -08:00
Stephanie Wang
fd5b58a827 Increase timeout for object manager valgrind tests (#4027)
* Avoid second copy of data for inlined objects

* Increase Wait timeout for valgrind tests

* Run object manager tests with and without inlined objects

* Fix test
2019-02-13 18:29:03 -08:00
Stephanie Wang
4347ab644e
Use Redis lists in the GCS instead of zset (#4023)
* Convert zset to list

* Remove object evictions map from the object directory, yay

* comments

* Fix tests
2019-02-13 10:32:57 -08:00
Hao Chen
f31a79f3f7
Implement actor checkpointing (#3839)
* Implement Actor checkpointing

* docs

* fix

* fix

* fix

* move restore-from-checkpoint to HandleActorStateTransition

* Revert "move restore-from-checkpoint to HandleActorStateTransition"

This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.

* resubmit waiting tasks when actor frontier restored

* add doc about num_actor_checkpoints_to_keep=1

* add num_actor_checkpoints_to_keep to Cython

* add checkpoint_expired api

* check if actor class is abstract

* change checkpoint_ids to long string

* implement java

* Refactor to delay actor creation publish until checkpoint is resumed

* debug, lint

* Erase from checkpoints to restore if task fails

* fix lint

* update comments

* avoid duplicated actor notification log

* fix unintended change

* add actor_id to checkpoint_expired

* small java updates

* make checkpoint info per actor

* lint

* Remove logging

* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager

* Replace old actor checkpointing tests

* Fix test and lint

* address comments

* consolidate kill_actor

* Remove __ray_checkpoint__

* fix non-ascii char

* Loosen test checks

* fix java

* fix sphinx-build
2019-02-13 19:39:02 +08:00
Zhijun Fu
7097ba393b protect raylet against bad messages (#4003)
* protect raylet against bad messages

* address comments

* linting and regression test
2019-02-12 00:39:38 +08:00
Yuhong Guo
5fb1efd60d Fix CI test failures (#4007) 2019-02-11 11:01:14 +08:00
Yuhong Guo
3a66d47a3a
Remove RAY_CHECK from JNI code (#3978)
* Remove RAY_CHECK in JNI

* Try to add mvn test to test the exception.

* Refine

* Address comments
2019-02-09 18:10:22 +08:00
Robert Nishihara
ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Ion
f987572795 Inline objects (#3756)
* added store_client_ to object_manager and node_manager

* half through...

* all code in, and compiling! Nothing tested though...

* something is working ;-)

* added a few more comments

* now, add only one entry to the in GCS for inlined objects

* more comments

* remove a spurious todo

* some comment updates

* add test

* added support for meta data for inline objects

* avoid some copies

* Initialize plasma client in tests

* Better comments. Enable configuring nline_object_max_size_bytes.

* Update src/ray/object_manager/object_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* fiexed comments

* fixed various typos in comments

* updated comments in object_manager.h and object_manager.cc

* addressed all comments...hopefully ;-)

* Only add eviction entries for objects that are not inlined

* fixed a bunch of comments

* Fix test

* Fix object transfer dump test

* lint

* Comments

* Fix test?

* Fix test?

* lint

* fix build

* Fix build

* lint

* Use const ref

* Fixes, don't let object manager hang

* Increase object transfer retry time for travis?

* Fix test

* Fix test?

* Add internal config to java, fix PlasmaFreeTest
2019-02-07 10:32:39 -08:00
Stephanie Wang
49e9bec988
Fix raylet bug in driver cleanup (#3962)
* Fix task dependency manager cleanup on driver exit

* Add regression test

* Better check, update header
2019-02-06 11:19:10 -08:00
Stephanie Wang
244fd473f4
Only mark tasks as forwarded if they are in the lineage cache (#3958) 2019-02-05 23:01:38 -08:00
Eric Liang
5fb813ff39
Don't check fail on missing lineage cache entry (#3861) 2019-02-04 17:45:41 -08:00
Kai Yang
02766adeca Limit maximum starting workers per language (#3852) 2019-01-29 21:43:12 -08:00
Yuhong Guo
c45b91dcca Make redis module safe without crashing by removing RAY_CHECK (#3855) 2019-01-29 21:06:31 -08:00
Philipp Moritz
0aadf11c10 Fix compilation on macOS by adding virtual destructors (#3878) 2019-01-28 13:22:52 -08:00
Stephanie Wang
eddd60e14e Improve backend debug logging, refactor scheduling queues (#3819) 2019-01-26 16:15:48 +08:00
Philipp Moritz
20162ce159 Compile raylet cython bindings with bazel (#3842) 2019-01-25 00:57:31 -08:00
Si-Yuan
48139cf861 Migrate Python C extension to Cython (#3541) 2019-01-24 09:17:14 -08:00
Yuhong Guo
c1a52b1c86 Remove duplicated code in RayConfig (#3831) 2019-01-24 17:04:10 +08:00
Hao Chen
bfcf254e52 Fix: do not treat actor task as failed if the actor will be reconstructed (#3736) 2019-01-23 23:28:44 -08:00
Robert Nishihara
0b1608a546 Factor out code for starting new processes and test plasma store in valgrind. (#3824)
* Factor out starting Ray processes.

* Detect flags through environment variables.

* Return ProcessInfo from start_ray_process.

* Print valgrind errors at exit.

* Test valgrind in travis.

* Some valgrind fixes.

* Undo raylet monitor change.

* Only test plasma store in valgrind.
2019-01-22 14:59:11 -08:00
Philipp Moritz
931e6a2fc3 Fix compilation error on ARM. (#3800) 2019-01-18 00:25:16 -08:00
Si-Yuan
16a3b99d8d Get rid of Arrow test utils (#3734)
* convert code to proper C++

* revert changes to "id.h" because #3765 has been merged.

* revert changes to Python bindings because they will be removed in #3541

* remove dependencies of Arrow logging

* revert changes to Arrow logging

* lint
2019-01-17 18:35:41 -08:00
Hao Chen
d1840bc7a9 Simplify RayConfig (#3714) 2019-01-16 16:43:26 -08:00
Tianming Xu
0b8008f41c remove RAY_CHECK around wait_state.remaining.erase (#3745) 2019-01-14 10:32:31 -08:00
Philipp Moritz
02bdaf221d Update arrow to include https://github.com/apache/arrow/pull/3392 (#3765)
* update arrow to include https://github.com/apache/arrow/pull/3392

* add appropriate includes

* update
2019-01-14 19:20:26 +08:00
Wang Qing
8674606e26 Support to auto-generate Java files from flatbuffer (#3749)
* auto gen flatbuffers for Java

* Add auto_gen_tool.py

* Refine

* Add a comment

* address comments.

* Address comments.

* Addressed

* Refine

* Address comments

* Fix typo

* Add exception

* Address comments.

* Refine

* Fix lint

* Fix

* Fix lint and address comment.

* Fix lint error
2019-01-13 11:39:23 -08:00
Yuhong Guo
d2cf8561f2 Refactor code about ray.ObjectID. (#3674)
* Refactor code about ray.ObjectID.

* remove from_random and use nil_id instead of constructor

* remove id() in hash

* Lint and fix

* Change driver id to ObjectID

* Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()
2019-01-13 01:47:29 -08:00
Wang Qing
fa2bfa6d76 Fix some small code quality issues. (#3719) 2019-01-11 15:24:49 +08:00
Hao Chen
6fc3fc4120 Cap task lease timeout (#3707) 2019-01-09 17:19:48 -08:00
Stephanie Wang
04f31db54d
Actor dummy object garbage collection (#3593)
* Convert UniqueID::nil() to a constructor

* Cleanup actor handle pickling code

* Add new actor handles to the task spec

* Pass in new actor handles

* Add new handles to the actor registration

* Regression test for actor handle forking and GC

* lint and doc

* Handle pickled actor handles in the backend and some refactoring

* Add regression test for dummy object GC and pickled actor handles

* Check for duplicate actor tasks on submission

* Regression test for forking twice, fix failed named actor leak

* Fix bug for forking twice

* lint

* Revert "Fix bug for forking twice"

This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac.

* Add new actor handles when task is assigned, not finished

* Remove comment

* remove UniqueID()

* Updates

* update

* fix

* fix java

* fixes

* fix
2019-01-09 10:37:11 -08:00
Wenting Shen
3027dde303 Fix some storage problems of RayLog (#3595)
1. Fix the problem of duplicated stored logs.
2. Save log whose level  is higher than severity_threshold, not only with severity_threshold.
3. Fix a `log_dir` bug: storing logs in a wrong path.
2019-01-09 13:54:21 +08:00
Robert Nishihara
067976ad3d Push a warning to all users when large number of workers have been started. (#3645)
* Push a warning to all users when large number of workers have been started.

* Add test.

* Fix bug.

* Give warning when worker starts instead of when worker registers.

* Fix

* Fix tests
2019-01-05 13:27:32 -08:00
Robert Nishihara
b6bcd18d65 Split profile table among many keys in the GCS. (#3676)
* Divide profile table among many keys in GCS.

* Fix, and remove --collect-profiling-data arg.

* Remove reference in doc.
2019-01-02 21:33:01 -08:00
Yuhong Guo
93e9d2b82c Improve backend log: env variable setting and format refine. (#3662)
* Improve backend logging

* Address comment

* Fix Raul's comment
2019-01-01 21:45:29 -08:00
Zhijun Fu
382b138fc7 fix code issues in object manager that are reported by scanning tool (#3649)
Fix some code issues found by code scanning tool:

**1. Macro compares unsigned to 0(NO_EFFECT)**

CWE570: An unsigned value can never be less than 0
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "this->create_buffer_state_[object_id].num_seals_remaining >= 0UL".

~/ray/src/ray/object_manager/object_buffer_pool.cc: ray::ObjectBufferPool::SealChunk(const ray::UniqueID &, unsigned long)

**2. Inferred misuse of enum(MIXED_ENUMS)**

CWE398: An integer expression which was inferred to have an enum type is mixed with a different enum type
This case, "static_cast(ray::object_manager::protocol::MessageType::PushRequest)", implies the effective type of "message_type" is "ray::object_manager::protocol::MessageType".

~/ray/src/ray/object_manager/object_manager.cc: ray::ObjectManager::ProcessClientMessage(std::shared_ptr> &, long, const unsigned char *)
2018-12-28 14:38:59 -08:00