Commit graph

30 commits

Author SHA1 Message Date
Eric Liang
0e00533ed4
Different approach to removing RayGetError (#3471) 2018-12-12 20:30:51 -08:00
Stephanie Wang
48a5935224 Fault tolerance for actor creation (#3422)
* Add regression test

* Request actor creation if no actor location found

* Comments

* Address comments

* Increase test timeout

* Trigger test
2018-11-29 10:48:35 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death (#3359)
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Robert Nishihara
32f0d6b77e Deprecate num_workers argument to ray.init and ray start. (#3114)
* Remove num_workers argument.

* Fix

* Fix
2018-10-28 20:12:49 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Robert Nishihara
9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Robert Nishihara
3ce8eb2d4c Test dying_worker_get and dying_worker_wait for xray. (#2997)
This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.
2018-10-02 00:08:47 -07:00
Hanwei Jin
fbf214e408 update ray cmake build process (#2853)
* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance

* support boost external project, avoid using the system or build.sh boost

* keep compatible with build.sh, remove boost and arrow build from it.

* bugfix: parquet bison version control, plasma_java lib install problem

* bugfix: cmake, do not compile plasma java client if no need

* bugfix: component failures test timeout machenism has problem for plasma manager failed case

* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master

* revert some fix

* set arrow python executable, fix format error in component_failures_test.py

* make clean arrow python build directory

* update cmake code style, back to support cmake minimum version 3.4
2018-09-12 11:19:33 -07:00
Robert Nishihara
eda6ebb87d Convert some unittests to pytest. (#2779)
* Convert multi_node_test.py to pytest.

* Convert array_test.py to pytest.

* Convert failure_test.py to pytest.

* Convert microbenchmarks to pytest.

* Convert component_failures_test.py to pytest and some minor quotes changes.

* Convert tensorflow_test.py to pytest.

* Convert actor_test.py to pytest.

* Fix.

* Fix
2018-08-31 11:24:15 -07:00
Robert Nishihara
b7722897b4 Deprecate 'driver_mode' argument. (#2758)
* Deprecate 'driver_mode' argument.

* Fix

* Fix
2018-08-28 16:45:49 -07:00
Stephanie Wang
806fdf2f05 [xray] Object manager retries Pull requests (#2630)
* Move all ObjectManager members to bottom of class def

* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients

* increase object manager Pull timeout

* Make the component failure test harder.

* note

* Notify SubscribeObjectLocations caller of empty list

* Address melih's comments

* Fix wait...

* Make component failure test easier for legacy ray

* lint
2018-08-13 19:15:55 -07:00
Stephanie Wang
4a7be6f46d [xray] Make sure raylet does not crash if remote raylet dies (#2619)
* Log a warning on remote object manager failures

* Mark a task that was failed to be forwarded as pending

* Raylet component failure test and make it harder

* Turn on component failure test for xray

* Remove return status from ReleaseSender

* lint
2018-08-09 20:36:30 -07:00
Philipp Moritz
d8ba667175 Convert asserts in unittest to pytest (#2529) 2018-08-01 22:32:10 -07:00
Robert Nishihara
515da7721a Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (#2374)
* Change ray.worker.cleanup -> ray.shutdown and improve API documentation.

* Deprecate ray.worker.cleanup() gracefully.

* Fix linting
2018-07-12 12:00:00 -07:00
Philipp Moritz
762bdf646e [xray] Put GCS data into the redis data shard (#2298) 2018-06-30 15:42:10 -10:00
Philipp Moritz
74162d1492 Lint Python files with Yapf (#1872) 2018-04-11 10:11:35 -07:00
Philipp Moritz
a3f8fa426b Start integrating new GCS APIs (#1379)
* Start integrating new GCS calls

* fixes

* tests

* cleanup

* cleanup and valgrind fix

* update tests

* fix valgrind

* fix more valgrind

* fixes

* add separate tests for GCS

* fix linting

* update tests

* cleanup

* fix python linting

* more fixes

* fix linting

* add plasma manager callback

* add some documentation

* fix linting

* fix linting

* fixes

* update

* fix linting

* fix

* add spillback count

* fixes

* linting

* fixes

* fix linting

* fix

* fix

* fix
2018-01-31 11:01:12 -08:00
Stephanie Wang
74ac80631b Local scheduler sends a null heartbeat to global scheduler (#962)
* Local scheduler sends a null heartbeat to global scheduler to notify death

* Add whitespace.

* Speed up component failures test

* Free local scheduler state upon plasma manager disconnection
2017-09-12 10:45:21 -07:00
Stephanie Wang
7496c98010 Fault tolerance race (#894)
* Remove race between local scheduler disconnecting and global scheduler
assigning a task

* Fix number of workers started in component failures test

* Fix race between global scheduler retrying a task assignment and monitor
cleaning up task table. The global scheduler should only retry the task
assignment if the local scheduler is still alive.

* Clean up task_table_update callback if failure

* Look up current local scheduler mapping when retrying actor task submission

* Log warning if no subscribers received a task table update

* Clean up database handle memory in local scheduler
2017-08-30 22:20:50 -07:00
Philipp Moritz
c3b39b4d86 Pull Plasma from Apache Arrow and remove Plasma store from Ray. (#692)
* Rebase Ray on top of Plasma in Apache Arrow

* add thirdparty building scripts

* use rebased arrow

* fix

* fix build

* fix python visibility

* comment out C tests for now

* fix multithreading

* fix

* reduce logging

* fix plasma manager multithreading

* make sure old and new object IDs can coexist peacefully

* more rebasing

* update

* fixes

* fix

* install pyarrow

* install cython

* fix

* install newer cmake

* fix

* rebase on top of latest arrow

* getting runtest.py run locally (needed to comment out a test for that to work)

* work on plasma tests

* more fixes

* fix local scheduler tests

* fix global scheduler test

* more fixes

* fix python 3 bytes vs string

* fix manager tests valgrind

* fix documentation building

* fix linting

* fix c++ linting

* fix linting

* add tests back in

* Install without sudo.

* Set PKG_CONFIG_PATH in build.sh so that Ray can find plasma.

* Install pkg-config

* Link -lpthread, note that find_package(Threads) doesn't seem to work reliably.

* Comment in testGPUIDs in runtest.py.

* Set PKG_CONFIG_PATH when building pyarrow.

* Pull apache/arrow and not pcmoritz/arrow.

* Fix installation in docker image.

* adapt to changes of the plasma api

* Fix installation of pyarrow module.

* Fix linting.

* Use correct python executable to build pyarrow.
2017-07-31 21:04:15 -07:00
Robert Nishihara
e0867c8845 Switch Python indentation from 2 spaces to 4 spaces. (#726)
* 4 space indentation for actor.py.

* 4 space indentation for worker.py.

* 4 space indentation for more files.

* 4 space indentation for some test files.

* Check indentation in Travis.

* 4 space indentation for some rl files.

* Fix failure test.

* Fix multi_node_test.

* 4 space indentation for more files.

* 4 space indentation for remaining files.

* Fixes.
2017-07-13 21:53:57 +00:00
Philipp Moritz
8798f4e690 fix flaky mac os x plasma store component_failure_test (#673)
Fix flaky mac os x plasma store component_failure_test.
2017-06-15 00:31:50 -07:00
Philipp Moritz
b94b4a35e0 Make the Plasma store ready for Arrow integration (#579)
* port plasma to arrow

* fixes

* refactor plasma client

* more modernization

* fix plasma manager tests

* everything compiles

* fix plasma client tests

* update plasma serialization tests

* fix plasma manager tests

* fix bug

* updates

* fix bug

* fix tests

* fix rebase

* address comments

* fix travis valgrind build

* fix linting

* fix include order again

* fix linting

* address comments
2017-05-31 16:24:23 -07:00
Robert Nishihara
0eac3ccdd0 Reduce verbosity of component_failures_test.py. (#440) 2017-04-07 23:05:29 -07:00
Robert Nishihara
ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Stephanie Wang
be1618f041 Availability after worker failure (#316)
* Availability after a killed worker

* Workers exit cleanly

* Memory cleanup in photon C tests

* Worker failure in multinode

* Consolidate worker cleanup handlers

* Update the result table before handling a task submission

* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS

* Log a warning instead of crashing if no result table entry found
2017-02-25 20:19:36 -08:00
Stephanie Wang
2b8e6485e3 Start and clean up workers from the local scheduler. (#250)
* Start and clean up workers from the local scheduler

Ability to kill workers in photon scheduler

Test for old method of starting workers

Common codepath for killing workers

Common codepath for killing workers

Photon test case for starting and killing workers

fix build

Fix component failure test

Register a worker's pid as part of initial connection

Address comments and revert photon_connect

Set PATH during travis install

Fix

* Fix photon test case to accept clients on plasma manager fd
2017-02-10 12:46:23 -08:00
Robert Nishihara
303d0fed3e Prevent plasma store and manager from dying when a client dies. (#203)
* Prevent plasma store and manager from dying when a worker dies.

* Check errno inside of warn_if_sigpipe. Passing in errno doesn't work because the arguments to warn_if_sigpipe can be evaluated out of order.
2017-01-17 20:34:31 -08:00