Commit graph

2221 commits

Author SHA1 Message Date
Maksim Smolin
e95455b7d7
[RaySGD] Add tqdm logging to TorchTrainer (#7588)
* Update issue templates

* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* Checkpoint the basics

* End of day checkpoint

* Checkpoint log-to-head implementation

* Checkpoint

* Add actor-based batch log reporting, currently segfaults

* Work around progress segfault

* Fix some stuff in quicktorch

* Make things more customizable

* Quality of life fixes

* More quality of life

* Move tqdm logic to training_operator

* Update examples

* Fix some minor bugs

* Fix merge

* Fix small things, add pbar to dcgan

* Run format.sh

* Fix missing epoch number for batch pbar

* Address PR comments

* Fix float is not subscriptable

* Add train_loss to pbar by default

* Isolate tqdm code into a handler system

* Format

* Remove the batch_logs_reporter from distributed runner as well

* Check if the train_loss is avaialbale before using it

* Enable tqdm in the dcgan example

* Fix a crash in no-handler trainers

* Fix

* Allow not calling set_reporters for tests

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-24 23:43:56 -07:00
Richard Liaw
54a892bb84
[tune] Cancel Experiment via Client (#7719)
* init cancel

* testing

* Update python/ray/tune/tests/test_tune_server.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Apply suggestions from code review

* finished

* set_finished

Co-authored-by: ijrsvt <ian.rodney@gmail.com>
2020-03-24 20:30:12 -07:00
Simon Mo
a519b4f2a9
[Serve] Enhancement in HTTP Methods and Multi-route support (#7709) 2020-03-24 20:25:05 -07:00
Xianyang Liu
cc0490b55b
Several small fixes for function_manager (#7685) 2020-03-24 14:28:15 -07:00
fangfengbin
bf866de6fd
Enable GCS Service by default (#7541) 2020-03-24 14:20:23 +08:00
mehrdadn
b4030cdbbe
File HANDLE/descriptor translation layer for Windows (#7657)
* Use TCP sockets on Windows with custom HANDLE <-> FD translation layer

* Get Plasma working on Windows

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-23 21:08:25 -07:00
Robert Nishihara
2b80310e6f
Remove setup.py dependence on packaging. (#7714) 2020-03-23 16:21:17 -07:00
Edward Oakes
9318b29f5e
Remove is_direct logic from the raylet (#7698) 2020-03-23 17:09:35 -05:00
Stephanie Wang
7f38cc1d03
Debug statements and increase timeout for test array (#7713) 2020-03-23 13:02:14 -07:00
aannadi
8adc84ccb9
[Dashboard] Add sorted columns and TensorBoard to Tune tab (#7140) 2020-03-23 12:30:51 -07:00
Sven Mika
1138f2ebed
[RLlib] Issue 7046 cannot restore keras model from h5 file. (#7482) 2020-03-23 12:19:30 -07:00
Robert Nishihara
ee8c9ff732
Remove six and cloudpickle from setup.py. (#7694) 2020-03-23 11:42:05 -07:00
Robert Nishihara
1a0c9228d0
Remove pytest from setup.py and other minor changes. (#7700) 2020-03-23 08:46:56 -07:00
Simon Mo
afad0ed085
[Serve] Add async, multi methods support for serve actors (#7682) 2020-03-23 00:45:26 -07:00
Robert Nishihara
8b4c2b7e88
Remove unnecessary handling of setproctitle and psutil. (#7702) 2020-03-22 22:06:42 -07:00
Robert Nishihara
4d722bf003
Remove dependence on funcsigs. (#7701) 2020-03-22 21:37:24 -07:00
Edward Oakes
8b4f5a9431
Remove non-direct-call code from core worker (#7625) 2020-03-22 19:20:08 -05:00
Richard Liaw
81d311031b
[tune] Update API Reference Page (#7671)
* widerdocs

* init

* docs

* fix

* moveit

* mix

* better_docs

* remove

* Apply suggestions from code review

Co-Authored-By: Sven Mika <sven@anyscale.io>

Co-authored-by: Sven Mika <sven@anyscale.io>
2020-03-22 16:42:20 -07:00
Eric Liang
288933ec6b
[rllib] Fix shared metrics context in parallel iterators (#7666)
* debug

* build

* update

* wip

* wpi

* update

* recurisve sync

* comment

* stream

* fix

* Update .travis.yml
2020-03-22 14:15:01 -07:00
Eric Liang
86f89fc3b3
[tune] Higher timeout for progress reporter test (#7679)
* wip

* medium size
2020-03-22 13:47:08 -07:00
Stephanie Wang
ba86a02b37
[core] Revert lineage pinning (#7499) (#7692)
* Revert "fix (#7681)"

This reverts commit 6a12a31b2e.

* Revert "[core] Pin lineage of plasma objects that are still in scope (#7499)"

This reverts commit 014929e658.
2020-03-21 18:35:43 -07:00
Simon Mo
89d959fd6a
Stop gap solution for cython functions breaking in memory monitor (#7687) 2020-03-21 15:16:12 -07:00
Zhijun Fu
a7a5d172b1
[core] fix bug that actor tasks from reconstructed actor is ignored by scheduling queue (#7637) 2020-03-21 13:05:24 +08:00
Edward Oakes
58dc70f90e
[minor] Remove get_global_worker(), RuntimeContext (#7638) 2020-03-20 15:45:29 -05:00
Stephanie Wang
014929e658
[core] Pin lineage of plasma objects that are still in scope (#7499)
* Add a lineage_ref_count to References

* Refactor TaskManager to store TaskEntry as a struct

* Refactor to fix deadlock between TaskManager and ReferenceCounter
Add references to task specs

* Pin TaskEntries and References in the lineage of any ObjectIDs in scope

* Fix deadlock, convert num_plasma_returns to a set of object IDs

* fix unit tests

* Feature flag

* Do not release lineage for objects that were promoted to plasma

* fix build

* fix build

* Remove num executions

* Simplify num return values

* Remove unused

* doc

* Set num returns

* Move lineage pinning flag to ReferenceCounter

* comments

* Fixes

* Remove irrelevant test (replaced by ref counting tests)
2020-03-20 10:56:43 -07:00
ZhuSenlin
7d08b418fc
fix test_worker_stats (#7655)
* fix test_worker_stats

* fix lint error

* fix lint error

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-03-20 14:53:40 +08:00
mehrdadn
e69664b74b
Miscellaneous Windows compatibility bugfixes (#7658)
* Windows compatibility bug fixes

* Use WSASend/WSARecv as WSASendMsg/WSARecvMsg do not work with TCP sockets

* Clean up some TODOs

* Fix duplicate compilations

* RedisAsioClient boost::asio::error::connection_reset

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-19 19:32:53 -07:00
Eric Liang
5a112ab212
Remove object store memory cap (#7654) 2020-03-19 16:00:30 -07:00
Clark Zinzow
c37f6e745a
Remove duplicate jsonschema from setup.py (#7665) 2020-03-19 13:12:47 -07:00
Stephanie Wang
b499100a88
Enable distributed ref counting by default (#7628)
* enable

* Turn on eager eviction

* Shorten tests and drain ReferenceCounter

* Don't force kill actor handles that have gone out of scope, lint

* Fix locks

* Cleanup Plasma Async Callback (#7452)

* [rllib][tune] fix some nans (#7611)

* Change /tmp to platform-specific temporary directory (#7529)

* [Serve] UI Improvements (#7569)

* bugfix about test_dynres.py (#7615)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* Java call Python actor method use actor.call (#7614)

* bug fix about useage of absl::flat_hash_map::erase and absl::flat_hash_set::erase (#7633)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [Java] Make both `RayActor` and `RayPyActor` inheriting from `BaseActor` (#7462)

* [Java] Fix the issue that the cached value in `RayObject` is serialized (#7613)

* Add failure tests to test_reference_counting (#7400)

* Fix typo in asyncio documentation (#7602)

* Fix segfault

* debug

* Force kill actor

* Fix test
2020-03-18 22:39:21 -07:00
fangfengbin
fca9dc73e1
Fix test_raylet_pending_tasks test case failed (#7636) 2020-03-19 11:09:38 +08:00
Seung Hyeon, Kim
ee49f4a875
[tune] Fix an example for _Brackets of async hyperband scheduler (#7538) 2020-03-18 19:06:32 -07:00
Richard Liaw
ea10cd212c
[tune] add accessible trial_info (#7378)
* add accessible trial_info

* trial name and info

* doc

* fix
gp

* Update doc/source/tune-package-ref.rst

* Apply suggestions from code review

* fix

* trial

* fixtest

* testfix
2020-03-17 23:44:18 -07:00
Eric Liang
745b9d643d
First pass at ray memory command for memory debugging (#7589) 2020-03-17 20:45:07 -07:00
Edward Oakes
c1b0f9ccdf
Add failure tests to test_reference_counting (#7400) 2020-03-17 10:30:21 -05:00
fyrestone
7697ea2be2
Java call Python actor method use actor.call (#7614) 2020-03-17 14:52:43 +08:00
Simon Mo
ce0885a897
[Serve] UI Improvements (#7569) 2020-03-16 22:23:16 -07:00
mehrdadn
a0700e2f86
Change /tmp to platform-specific temporary directory (#7529) 2020-03-16 18:10:14 -07:00
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans (#7611) 2020-03-16 11:19:58 -07:00
ijrsvt
46953c53b1
Cleanup Plasma Async Callback (#7452) 2020-03-16 10:12:44 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support (#7080)
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
Simon Mo
3f1fcaa024
Blocking ray.get/wait inside async context will warn instead of error (#7262) 2020-03-14 22:02:30 -07:00
Kai Yang
630e48967d
[Java] Allow passing internal config from raylet to Java worker (#7532) 2020-03-15 12:03:38 +08:00
Stephanie Wang
53549314c5
[core] Option to fallback to LRU on OutOfMemory (#7410)
* Add a test for LRU fallback

* Update error message

* Upgrade arrow to master

* Integrate with arrow

* Revert "Bazel mirrors (#7385)"

This reverts commit 44aded5272.

* Don't LRU evict

* Revert "Revert "Bazel mirrors (#7385)""

This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3.

* Add lru_evict flag

* fix internal config

* Fix

* upgrade arrow

* debug

* Set free period in config for lru_evict, override max retries to fix
test

* Fix test?

* fix test

* Revert "debug"

This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017.

* fix exception str

* Fix ref count test

* Shorten travis test?
2020-03-14 11:28:43 -07:00
Anthony Yu
094125cf03
[tune] Dragonfly integration ask tell nit (#7593)
* Add sample example

* Copy relevant lines of ask from inherited Optimizer

* Ignore strategy

* Additional changes

* Add DragonflySearch for tune connector for Dragonfly

* Add example and fix small errors

* lint

* Remove skopt references

* Update example based off of Dragonfly changes

* Edit example for final Dragonfly edits

* Formatting and documentation edits

* Add documentation and add to test pipeline

* Address PR comments

* Fix Jenkins test

* Adjust Dragonfly to PR#7366

* Lint

* fix_tests

* Minor changes to ordering

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-13 15:27:03 -07:00
Kai Yang
d6e8f47065
Add a flag to disable reconstruction for a killed actor (#7346) 2020-03-13 19:10:21 +08:00
Ujval Misra
6022eb53c4
[tune] Use newest checkpoint in normal operation (#7563)
* Use persistent checkpoint for failures

* Fix test

* Add unpause test

* move test

* Fix tests

* remove debug statement

* Mark test as flaky
2020-03-12 22:21:42 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true (#7166) 2020-03-12 22:13:56 +08:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API (#7497) 2020-03-12 00:54:08 -07:00
Kai Yang
932a749fa9
Fix the java_worker_options parameter (#7537)
* fix Java CI

* Minor fix

* move json.loads out of build_java_worker_command

* lint

* fix cross language test
2020-03-12 10:44:23 +08:00