Commit graph

1715 commits

Author SHA1 Message Date
Edward Oakes
c73fdb7425
Ignore errors in ObjectID.__dealloc__ (#5997) 2019-10-24 16:48:47 -07:00
Philipp Moritz
09d05bb3fa
Reduce actor submission python overhead (#5949) 2019-10-23 00:11:32 -07:00
Edward Oakes
02931e08f3
[core worker] Python core worker task execution (#5783)
Executes tasks via the event loop in the C++ core worker. Also properly handles signals (including KeyboardInterrupt), so ctrl-C in a python interactive shell works now (if connecting to an existing cluster).
2019-10-22 20:15:59 -07:00
Siyuan (Ryans) Zhuang
95241f6686 Fix the incorrect serialization behavior with pickle (#5960) 2019-10-22 18:08:36 -07:00
Richard Liaw
81dd0dfb0a
[tune] fix conditional identifier (#5971)
* fix conditional identifier

* fix

* doc
2019-10-22 02:00:49 -07:00
Richard Liaw
252a5d13ed
[sgd/tune][minor] more tf ports (#5953) 2019-10-21 16:46:16 -07:00
Mitchell Stern
235dec8aa3 [Dashboard] Remove token authentication from dashboard (#5888) 2019-10-21 12:48:48 -07:00
Richard Liaw
26a724c5e6
[core] Support kwargs and positionals in Ray remote calls (#5606) 2019-10-20 22:40:54 -07:00
Edward Oakes
fc56872012
Send active object IDs to the raylet (#5803)
* Send active object IDs to the raylet

* comment

* comments

* dedup

* signed int in config

* comments

* Remove object ID from monitor

* Fix test

* re-add check

* fix cast

* check if core worker

* Add comment

* Reservoir sampling

* Fix lint

* Pointer return

* tmp

* Fix merge

* Initialize object ids properly

* Fix lint
2019-10-20 22:05:28 -07:00
Simon Mo
6b36ef1138
[Serve] Ensure strict traffic splitting (#5929)
* [Serve] Ensure strict traffic splitting

* Fix test
2019-10-20 20:18:14 -07:00
Stephanie Wang
bc4a0de4da
Fix multiple drivers for named actors and add test (#5956) 2019-10-20 16:04:21 -07:00
Richard Liaw
74852c80cb
[docs] Improve more serialization Errors (#5658) 2019-10-20 14:06:00 -07:00
Richard Liaw
91acecc9f9
[tune][minor] gpu warning (#5948)
* gpu

* formaat

* defaults

* format_and_check

* better registration

* fix

* fix

* trial

* foramt

* tune
2019-10-19 17:09:48 -07:00
Philipp Moritz
d23696de17
Introduce flag to use pickle for serialization (#5805) 2019-10-18 22:29:36 -07:00
Philipp Moritz
29eee7f970
Forward multiple ports for autoscaler (#5893) 2019-10-18 16:50:46 -07:00
Richard Liaw
48ba484640
[tune] Test TF2.0, TF1.14, TF1.12 Tensorboard support (#5931) 2019-10-18 13:50:42 -07:00
Stephanie Wang
697f765efc
Refactor CoreWorker to remove TaskInterface (#5924)
* Remove TaskInterface

* Remove Status return value

* Remove CActorHandle, some return values, TaskSubmitter

* lint

* doc

* doc

* fix build

* lint

* Return Status, guarded by annotation, fail tasks for RECONSTRUCTING actors

* fix

* move annotation

* revert

* Fix core worker test

* nits
2019-10-18 00:03:57 -04:00
Stephanie Wang
3ac8592dcf
Remove actor handle IDs (#5889)
* Remove actor handle ID from main ActorHandle constructor

* Set the actor caller ID when calling submit task instead of in the actor handle

* Remove ActorHandle::Fork, remove actor handle ID from protobuf

* Make inner actor handle const, remove new_actor_handles

* Move caller ID into the common task spec, start refactoring raylet

* Some fixes for forking actor handles

* Store ActorHandle state in CoreWorker, only expose actor ID to Python

* Remove some unused fields

* lint

* doc

* fix merge

* Remove ActorHandleID from python/cpp

* doc

* Fix core worker test

* Move actor table subscription to CoreWorker, reset actor handles on actor failure

* lint

* Remove GCS client from direct actor

* fix tests

* Fix

* Fix tests for raylet codepath

* Fix local mode

* Fix multithreaded test

* Fix AsyncSubscribe issue...

* doc

* fix serve

* Revert bazel
2019-10-17 12:36:34 -04:00
Philipp Moritz
32b2907457
Update max resource label and give better error message (#5916) 2019-10-16 22:37:01 -07:00
Peter Schafhalter
6c11b534c8 [Autoscaler] Update AWS Deep Learning AMI to version 24.3 (#5932) 2019-10-16 16:50:54 -07:00
Richard Liaw
9f23620412
[tune] tf2.0 mnist example (#5898)
* tfmnistexample

* tfmnist

* add_to_ci

* format

* exampledownlaod

* fix
2019-10-15 22:25:01 -07:00
Eric Liang
6843a01a7f
Automatically create custom node id resource (#5882)
* node id

* comment

* comments

* fix tests
2019-10-15 21:31:11 -07:00
Richard Liaw
c52bb0621d
[tune] Support TF2.0 on Keras Callback (#5912) 2019-10-15 10:49:50 -07:00
Eric Liang
69d5c1b53a
remove evil redirects (#5919) 2019-10-14 19:41:04 -07:00
Camille Couturier
320cba313f [tune] Explicitly set scheduler in run() (#5871)
* Explicitely set scheduler in run()

* Better formatting/indentation (after running format.sh)

* Remove accidental paste in parameters definitions.

* format
2019-10-14 15:44:59 -07:00
Philipp Moritz
8fd23c0c3f
Add back TensorFlow test (#5885) 2019-10-14 11:26:02 -07:00
Richard Liaw
20c0cdee4f
[autoscaler] Worker-Head termination + Better Scale-up message (#5909) 2019-10-14 10:37:50 -07:00
Edward Oakes
abbfe7392f
Bump dev version to 0.8.0.dev6 (#5906) 2019-10-14 11:36:13 +01:00
Richard Liaw
1650f7b174
[tune] Remove TF MNIST example + add TrialRunner hook to execut… (#5868)
* remove test

* add trial runner

* remvoerestore

* Remove other mnist examples

* tunetest

* revert

* v1

* Revert "v1"

This reverts commit c8bddaf2db7a8270c43c02021cac0e75df15ed20.

* Revert "revert"

This reverts commit b58f56884a0c288d3a6f997d149ab4d496ddd7a3.

* errors

* format
2019-10-13 20:33:56 -07:00
Richard Liaw
52e5c9b22d
[tune] CPU-Only Head Node support (#5900)
* trialqueue

* add tests
2019-10-13 20:31:42 -07:00
Eric Liang
2cbc67f3d5 Fix test_dying_worker_get (#5908) 2019-10-13 18:06:28 -07:00
Richard Liaw
0f24509c30 [autoscaler] uptime redirect fix (#5907)
* small change

* comment
2019-10-13 23:25:15 +01:00
Edward Oakes
6eaa8e31fa
[autoscaler] Revert to double-spawning updater threads (#5903)
* [autoscaler] Revert to double-spawning threads

* Use log prefix

* add comment
2019-10-13 20:00:06 +01:00
Simon Mo
97a786cf11
[Serve] Remove handle passing in tail recursion (#5894)
* Remove handle pass in tail recursion

* Quick fix

* Fix worker timeout issue
2019-10-12 20:13:20 -07:00
Eric Liang
0e8c3c0346
Don't wrap RayError with RayTaskError (#5870) 2019-10-11 11:00:08 -07:00
Edward Oakes
779f91523b [autoscaler] Fix quoting (#5891) 2019-10-11 00:40:26 -07:00
Simon Mo
4b99cb429e [Serve] Hotfix: Fix actor handle hashing in metric monitoring (#5886) 2019-10-11 00:31:42 -07:00
Robert Nishihara
523c764c25
Python 2 compatibility. (#5887) 2019-10-10 19:09:25 -07:00
Eric Liang
c3b2ae26c5
Fix str of RayTaskError (#5878)
* fix key error

* fix
2019-10-10 16:53:18 -07:00
Mitchell Stern
195ca43e9c [Dashboard] Improve handling of logs and errors in dashboard backend (#5857)
* Improve handling of logs and errors in dashboard backend

* Update nested dict comprehension for clarity
2019-10-10 11:59:54 -07:00
Eric Liang
1a8ac3db46
Implement fair task queueing to prevent task starvation (#5851)
* initial commit

* lint

* clarify

* add feature flag

* comment

* add timeout to test

* fix print

* comment

* use id for scheduling class

* lint

* dad warn

* flake
2019-10-08 21:04:25 -07:00
Richard Liaw
1181924077 [tune][minor] formatting examples, fix travis (#5869)
* formatting

* formatting
2019-10-08 17:58:43 -07:00
Ujval Misra
a851d7eb87 [tune] Readable trial progress output (#5822)
* Cleaner, tabulated progress output.

* Minor HTML changes, trial ID instead of name

* Revert basic variant changes

* Cleanup, address richard's comments, add progress_reporter.py

* Add tabulate dependency

* Added more info to table, auto-hide columns with no data.

* lint

* Address comments

* Replace experiment tag w/ trial ID

* Fixed tests.

* Fixed test

* Added requirement

* Fix formatting
2019-10-08 16:38:39 -07:00
Philipp Moritz
24b79fd0a6 temporarily remove tensorflow test (#5866) 2019-10-08 14:13:54 -07:00
Edward Oakes
42dd0fae96
Fix actor ID collision in local mode (#5863)
* Fixed local mode actor id

* Update python/ray/actor.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Added hyphen to match comments

* Added tests to test_local_mode

* Helloworld

* Better test naming

* lint
2019-10-08 13:07:42 -07:00
Ujval Misra
375852af23 [tune] Check node liveness before result fetch (#5844)
* Check if trial's node is alive before trying to fetch result

* Added function for failed trials to trial_executor interface

* Address comments, add test.
2019-10-08 11:41:01 -07:00
waldroje
054583ffe6 [tune] MedianStopping on result (#5402)
* added class median_stopping_result to schedulers and updated __init__

* Dicts flatten and combine schedulers.

MedianStoppingRule is now combined with MedianStoppingResult; I think
the functionality is essentially the same so there's no need to
duplicate.

Dict flattening was already taken care of in a separate PR, so I've
reverted that.

* lint

* revert

* remove time sharing and simplify state

* fix

* fixtests

* added class median_stopping_result to schedulers and updated __init__

* update property names and types to reflect suggestions by ray developers, merged get_median_result and get_best_result into a single method to eliminate duplicate steps, added resource check on PAUSE condition, modified utility function to use updated properties

* updated tests for median_stopping_result in separate file

* remove stray characters from previous merge conflict

* reformatted and cleaned up dependencies from running code format and linting

* added class median_stopping_result to schedulers and updated __init__

* Dicts flatten and combine schedulers.

MedianStoppingRule is now combined with MedianStoppingResult; I think
the functionality is essentially the same so there's no need to
duplicate.

Dict flattening was already taken care of in a separate PR, so I've
reverted that.

* lint

* revert

* remove time sharing and simplify state

* fix

* added class median_stopping_result to schedulers and updated __init__

* update property names and types to reflect suggestions by ray developers, merged get_median_result and get_best_result into a single method to eliminate duplicate steps, added resource check on PAUSE condition, modified utility function to use updated properties

* updated tests for median_stopping_result in separate file

* remove stray characters from previous merge conflict

* reformatted and cleaned up dependencies from running code format and linting

* update scheduler to coordinate eval interval

* modify median_stopping_result to synchronize result evaluation at regular intervals, driven by least common interval

* add some logging info to median_result

* add new scheduler, SyncMedianStoppingResult, which evaluates and stops trials in a synchronous fashion

* Cleanup median_stopping_rule

- remove eval_interval
- pause trials with insufficient samples if there are other waiting trials
- compute score only for trials that have reached result_time

* Remove extraneous classes

* Fix median stopping rule tests

* Added min_time_slice flag to reduce potential checkpointing cost

* Only compute mean after grace

* Relegate logging to debug mode
2019-10-08 11:40:41 -07:00
Philipp Moritz
785670bc18
Fix class attributes and methods for actor classes (#5802) 2019-10-07 23:56:07 -07:00
Edward Oakes
08e4e3a153
[core worker] Submit Python actor tasks through core worker (#5750)
* Submit actor tasks through core worker

* Fix java

* add comment

* Remove task builder

* Check negative

* Increase -> Increment

* pass by reference

* fix signal

* Clean up c++ actor handle

* more cleanup

* Clean up headers

* Fix unique_ptr construction

* Fix java

* Move profiling to c++

* dedup

* fix error

* comments

* fix java

* Fix tests

* wait for actor to exit

* Start after constructor

* ignore java build

* fix comment

* always init logging

* Fix logging

* fix logging issue

* shared_ptr for profiler

* DEBUG -> WARNING

* fix killed_ init

* Fix flaky checkpointing tests

* -v flag for tune tests

* Fix checkpoint test logic

* Fix exception matching

* timeout exception

* Fix test exception info

* Fix import

* fix build

* Fix test

* shared_ptr
2019-10-07 15:42:19 -07:00
Simon Mo
9bb3633cd9
[Serve] Implement metric interface (#5852)
* Implement metric interface

* Address comment: made actor_handles a dict

* Fix iteration

* Lint

* Mark lightweight actors as num_cpus=0 to prevent resource starvation

* Be more explicit about the readiness condition

* Make task_runner non-blocking

* Lint
2019-10-07 09:29:26 -07:00