Commit graph

666 commits

Author SHA1 Message Date
Wang Qing
7d776f35e1 Integrate metrics (#4246) 2019-04-02 21:01:02 -07:00
Yuhong Guo
c2c548bdfd Fix broken pipe callback (#4513) 2019-04-02 17:42:18 +08:00
Ruifang Chen
59d74d5e92 [Java] Build Java code with Bazel (#4284) 2019-03-22 14:30:05 +08:00
Ion
59079a799c Signal actor failure (#4196) 2019-03-21 15:17:42 -07:00
Kai Yang
c36d03874b Redis returns OK when removing a non-existent set entry (#4434) 2019-03-21 11:59:15 -07:00
Hao Chen
d03999d01e
Cross-language invocation Part 1: Java calling Python functions and actors (#4166) 2019-03-21 13:34:21 +08:00
Stephanie Wang
4ac9c1ed6e Fix bug in cluster mode where driver exits when there are tasks in the waiting queue (#4251) 2019-03-20 10:18:27 -07:00
Kai Yang
7ff56ce826 Introduce set data structure in GCS (#4199)
* Introduce set data structure in GCS. Change object table to Set instance.

* Fix a logic bug. Update python code.

* lint

* lint again

* Remove CURRENT_VALUE mode

* Remove 'CURRENT_VALUE'

* Add more test cases

* rename has_been_created to subscribed.

* Make `changed` parameter type of `bool *`

* Rename mode to notification_mode

* fix build

* RAY.SET_REMOVE return error if entry doesn't exist

* lint

* Address comments

* lint and fix build
2019-03-11 14:42:58 -07:00
Yuhong Guo
ba3fe04629 Fix message type to string crash (#4308)
* Fix message string crash

* Fix
2019-03-09 13:51:02 -08:00
Stephanie Wang
edc794751f Set TCP_NODELAY on all TCP connections (#4318) 2019-03-09 12:15:29 -08:00
Yuhong Guo
b9ea821d16
Use strongly typed IDs in C++. (#4185)
*  Use strongly typed IDs for C++.

* Avoid heap allocation in cython.

* Fix JNI part

* Fix rebase conflict

* Refine

* Remove type check from __init__

* Remove unused constructor declarations.
2019-03-07 21:43:01 +08:00
Stephanie Wang
0ccaf118a2
Disconnect object manager clients if receiving an object fails (#4141)
* Disconnect object manager clients if ReadBuffer fails

* unused

* put back EINTR handling
2019-03-05 22:08:26 -08:00
Stephanie Wang
8b871af555
Fix ray.wait bug for tasks on remote nodes and timeout=0 (#4242)
* Regression test

* Fix

* cleaner code
2019-03-04 11:46:06 -08:00
Yuhong Guo
6f46edca51 Skip dead nodes to avoid connection timeout. (#4154) 2019-03-02 13:11:19 -08:00
Hao Chen
484708d44d
Fix JNI throwing exception (#4178) 2019-02-28 15:11:25 +08:00
Philipp Moritz
615d5516d1 Compile valgrind tests with Bazel (#4144) 2019-02-24 00:00:49 -08:00
Philipp Moritz
ba52caff37 Make Bazel the default build system (#3898) 2019-02-23 11:58:59 -08:00
Philipp Moritz
9b3ce3e64b Revert inline objects PR (#4125)
* Revert "Inline objects (#3756)"

This reverts commit f987572795.

* fix rebase problems

* more rebase fixes

* add back debug statement
2019-02-22 18:21:01 -08:00
Tianming Xu
692bb336a1 Fix master branch compilation error and lint error (#4109) 2019-02-21 11:54:30 -08:00
Yuhong Guo
3549cd8195
Add the Delete function in GCS (#4081)
* Add the Delete function in GCS

* Unify BatchDelete and Delete

* Fix comment

* Lint

* Refine according to comments

* Unify test.

* Address comment

* C++ lint

* Update ray_redis_module.cc
2019-02-21 13:33:37 +08:00
Hao Chen
de17443dc2
Propagate backend error to worker (#4039) 2019-02-16 11:39:15 +08:00
Stephanie Wang
3684e5bc0d Fix memory leak in Redis by using auto memory management (#4054)
* Table appends should always succeed

* Use Redis auto memory management

* Remove unneeded namespace
2019-02-14 19:51:18 -08:00
Philipp Moritz
810cc17062 Fix LRU eviction of client notification datastructure (#4021)
* convert notification_key map to C++ datastructure

* fix crash and add debug string

* clean notification map up (this was a bug before)

* remove checks

* add jenkins test

* linting

* fixes

* properly erase

* clean up

* linting

* Update test_wait_hanging.py

* Update run_multi_node_tests.sh

* increase redis_max_memory

* fix dat jenkins

* update

* Update run_multi_node_tests.sh
2019-02-13 22:20:27 -08:00
Stephanie Wang
fd5b58a827 Increase timeout for object manager valgrind tests (#4027)
* Avoid second copy of data for inlined objects

* Increase Wait timeout for valgrind tests

* Run object manager tests with and without inlined objects

* Fix test
2019-02-13 18:29:03 -08:00
Stephanie Wang
4347ab644e
Use Redis lists in the GCS instead of zset (#4023)
* Convert zset to list

* Remove object evictions map from the object directory, yay

* comments

* Fix tests
2019-02-13 10:32:57 -08:00
Hao Chen
f31a79f3f7
Implement actor checkpointing (#3839)
* Implement Actor checkpointing

* docs

* fix

* fix

* fix

* move restore-from-checkpoint to HandleActorStateTransition

* Revert "move restore-from-checkpoint to HandleActorStateTransition"

This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.

* resubmit waiting tasks when actor frontier restored

* add doc about num_actor_checkpoints_to_keep=1

* add num_actor_checkpoints_to_keep to Cython

* add checkpoint_expired api

* check if actor class is abstract

* change checkpoint_ids to long string

* implement java

* Refactor to delay actor creation publish until checkpoint is resumed

* debug, lint

* Erase from checkpoints to restore if task fails

* fix lint

* update comments

* avoid duplicated actor notification log

* fix unintended change

* add actor_id to checkpoint_expired

* small java updates

* make checkpoint info per actor

* lint

* Remove logging

* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager

* Replace old actor checkpointing tests

* Fix test and lint

* address comments

* consolidate kill_actor

* Remove __ray_checkpoint__

* fix non-ascii char

* Loosen test checks

* fix java

* fix sphinx-build
2019-02-13 19:39:02 +08:00
Zhijun Fu
7097ba393b protect raylet against bad messages (#4003)
* protect raylet against bad messages

* address comments

* linting and regression test
2019-02-12 00:39:38 +08:00
Yuhong Guo
5fb1efd60d Fix CI test failures (#4007) 2019-02-11 11:01:14 +08:00
Yuhong Guo
3a66d47a3a
Remove RAY_CHECK from JNI code (#3978)
* Remove RAY_CHECK in JNI

* Try to add mvn test to test the exception.

* Refine

* Address comments
2019-02-09 18:10:22 +08:00
Robert Nishihara
ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Ion
f987572795 Inline objects (#3756)
* added store_client_ to object_manager and node_manager

* half through...

* all code in, and compiling! Nothing tested though...

* something is working ;-)

* added a few more comments

* now, add only one entry to the in GCS for inlined objects

* more comments

* remove a spurious todo

* some comment updates

* add test

* added support for meta data for inline objects

* avoid some copies

* Initialize plasma client in tests

* Better comments. Enable configuring nline_object_max_size_bytes.

* Update src/ray/object_manager/object_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: istoica <istoica@cs.berkeley.edu>

* fiexed comments

* fixed various typos in comments

* updated comments in object_manager.h and object_manager.cc

* addressed all comments...hopefully ;-)

* Only add eviction entries for objects that are not inlined

* fixed a bunch of comments

* Fix test

* Fix object transfer dump test

* lint

* Comments

* Fix test?

* Fix test?

* lint

* fix build

* Fix build

* lint

* Use const ref

* Fixes, don't let object manager hang

* Increase object transfer retry time for travis?

* Fix test

* Fix test?

* Add internal config to java, fix PlasmaFreeTest
2019-02-07 10:32:39 -08:00
Stephanie Wang
49e9bec988
Fix raylet bug in driver cleanup (#3962)
* Fix task dependency manager cleanup on driver exit

* Add regression test

* Better check, update header
2019-02-06 11:19:10 -08:00
Stephanie Wang
244fd473f4
Only mark tasks as forwarded if they are in the lineage cache (#3958) 2019-02-05 23:01:38 -08:00
Eric Liang
5fb813ff39
Don't check fail on missing lineage cache entry (#3861) 2019-02-04 17:45:41 -08:00
Kai Yang
02766adeca Limit maximum starting workers per language (#3852) 2019-01-29 21:43:12 -08:00
Yuhong Guo
c45b91dcca Make redis module safe without crashing by removing RAY_CHECK (#3855) 2019-01-29 21:06:31 -08:00
Philipp Moritz
0aadf11c10 Fix compilation on macOS by adding virtual destructors (#3878) 2019-01-28 13:22:52 -08:00
Stephanie Wang
eddd60e14e Improve backend debug logging, refactor scheduling queues (#3819) 2019-01-26 16:15:48 +08:00
Philipp Moritz
20162ce159 Compile raylet cython bindings with bazel (#3842) 2019-01-25 00:57:31 -08:00
Si-Yuan
48139cf861 Migrate Python C extension to Cython (#3541) 2019-01-24 09:17:14 -08:00
Yuhong Guo
c1a52b1c86 Remove duplicated code in RayConfig (#3831) 2019-01-24 17:04:10 +08:00
Hao Chen
bfcf254e52 Fix: do not treat actor task as failed if the actor will be reconstructed (#3736) 2019-01-23 23:28:44 -08:00
Robert Nishihara
0b1608a546 Factor out code for starting new processes and test plasma store in valgrind. (#3824)
* Factor out starting Ray processes.

* Detect flags through environment variables.

* Return ProcessInfo from start_ray_process.

* Print valgrind errors at exit.

* Test valgrind in travis.

* Some valgrind fixes.

* Undo raylet monitor change.

* Only test plasma store in valgrind.
2019-01-22 14:59:11 -08:00
Philipp Moritz
931e6a2fc3 Fix compilation error on ARM. (#3800) 2019-01-18 00:25:16 -08:00
Si-Yuan
16a3b99d8d Get rid of Arrow test utils (#3734)
* convert code to proper C++

* revert changes to "id.h" because #3765 has been merged.

* revert changes to Python bindings because they will be removed in #3541

* remove dependencies of Arrow logging

* revert changes to Arrow logging

* lint
2019-01-17 18:35:41 -08:00
Hao Chen
d1840bc7a9 Simplify RayConfig (#3714) 2019-01-16 16:43:26 -08:00
Tianming Xu
0b8008f41c remove RAY_CHECK around wait_state.remaining.erase (#3745) 2019-01-14 10:32:31 -08:00
Philipp Moritz
02bdaf221d Update arrow to include https://github.com/apache/arrow/pull/3392 (#3765)
* update arrow to include https://github.com/apache/arrow/pull/3392

* add appropriate includes

* update
2019-01-14 19:20:26 +08:00
Wang Qing
8674606e26 Support to auto-generate Java files from flatbuffer (#3749)
* auto gen flatbuffers for Java

* Add auto_gen_tool.py

* Refine

* Add a comment

* address comments.

* Address comments.

* Addressed

* Refine

* Address comments

* Fix typo

* Add exception

* Address comments.

* Refine

* Fix lint

* Fix

* Fix lint and address comment.

* Fix lint error
2019-01-13 11:39:23 -08:00
Yuhong Guo
d2cf8561f2 Refactor code about ray.ObjectID. (#3674)
* Refactor code about ray.ObjectID.

* remove from_random and use nil_id instead of constructor

* remove id() in hash

* Lint and fix

* Change driver id to ObjectID

* Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()
2019-01-13 01:47:29 -08:00