hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Maksim Smolin	e95455b7d7	[RaySGD] Add tqdm logging to TorchTrainer (#7588 ) * Update issue templates * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * Checkpoint the basics * End of day checkpoint * Checkpoint log-to-head implementation * Checkpoint * Add actor-based batch log reporting, currently segfaults * Work around progress segfault * Fix some stuff in quicktorch * Make things more customizable * Quality of life fixes * More quality of life * Move tqdm logic to training_operator * Update examples * Fix some minor bugs * Fix merge * Fix small things, add pbar to dcgan * Run format.sh * Fix missing epoch number for batch pbar * Address PR comments * Fix float is not subscriptable * Add train_loss to pbar by default * Isolate tqdm code into a handler system * Format * Remove the batch_logs_reporter from distributed runner as well * Check if the train_loss is avaialbale before using it * Enable tqdm in the dcgan example * Fix a crash in no-handler trainers * Fix * Allow not calling set_reporters for tests Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2020-03-24 23:43:56 -07:00
Richard Liaw	54a892bb84	[tune] Cancel Experiment via Client (#7719 ) * init cancel * testing * Update python/ray/tune/tests/test_tune_server.py Co-Authored-By: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Apply suggestions from code review * finished * set_finished Co-authored-by: ijrsvt <ian.rodney@gmail.com>	2020-03-24 20:30:12 -07:00
Simon Mo	a519b4f2a9	[Serve] Enhancement in HTTP Methods and Multi-route support (#7709 )	2020-03-24 20:25:05 -07:00
Xianyang Liu	cc0490b55b	Several small fixes for function_manager (#7685 )	2020-03-24 14:28:15 -07:00
fangfengbin	bf866de6fd	Enable GCS Service by default (#7541 )	2020-03-24 14:20:23 +08:00
mehrdadn	b4030cdbbe	File HANDLE/descriptor translation layer for Windows (#7657 ) * Use TCP sockets on Windows with custom HANDLE <-> FD translation layer * Get Plasma working on Windows Co-authored-by: Mehrdad <noreply@github.com>	2020-03-23 21:08:25 -07:00
Robert Nishihara	2b80310e6f	Remove setup.py dependence on packaging. (#7714 )	2020-03-23 16:21:17 -07:00
Edward Oakes	9318b29f5e	Remove is_direct logic from the raylet (#7698 )	2020-03-23 17:09:35 -05:00
Stephanie Wang	7f38cc1d03	Debug statements and increase timeout for test array (#7713 )	2020-03-23 13:02:14 -07:00
aannadi	8adc84ccb9	[Dashboard] Add sorted columns and TensorBoard to Tune tab (#7140 )	2020-03-23 12:30:51 -07:00
Sven Mika	1138f2ebed	[RLlib] Issue 7046 cannot restore keras model from h5 file. (#7482 )	2020-03-23 12:19:30 -07:00
Robert Nishihara	ee8c9ff732	Remove six and cloudpickle from setup.py. (#7694 )	2020-03-23 11:42:05 -07:00
Robert Nishihara	1a0c9228d0	Remove pytest from setup.py and other minor changes. (#7700 )	2020-03-23 08:46:56 -07:00
Simon Mo	afad0ed085	[Serve] Add async, multi methods support for serve actors (#7682 )	2020-03-23 00:45:26 -07:00
Robert Nishihara	8b4c2b7e88	Remove unnecessary handling of setproctitle and psutil. (#7702 )	2020-03-22 22:06:42 -07:00
Robert Nishihara	4d722bf003	Remove dependence on funcsigs. (#7701 )	2020-03-22 21:37:24 -07:00
Edward Oakes	8b4f5a9431	Remove non-direct-call code from core worker (#7625 )	2020-03-22 19:20:08 -05:00
Richard Liaw	81d311031b	[tune] Update API Reference Page (#7671 ) * widerdocs * init * docs * fix * moveit * mix * better_docs * remove * Apply suggestions from code review Co-Authored-By: Sven Mika <sven@anyscale.io> Co-authored-by: Sven Mika <sven@anyscale.io>	2020-03-22 16:42:20 -07:00
Eric Liang	288933ec6b	[rllib] Fix shared metrics context in parallel iterators (#7666 ) * debug * build * update * wip * wpi * update * recurisve sync * comment * stream * fix * Update .travis.yml	2020-03-22 14:15:01 -07:00
Eric Liang	86f89fc3b3	[tune] Higher timeout for progress reporter test (#7679 ) * wip * medium size	2020-03-22 13:47:08 -07:00
Stephanie Wang	ba86a02b37	[core] Revert lineage pinning (#7499 ) (#7692 ) * Revert "fix (#7681)" This reverts commit `6a12a31b2e`. * Revert "[core] Pin lineage of plasma objects that are still in scope (#7499)" This reverts commit `014929e658`.	2020-03-21 18:35:43 -07:00
Simon Mo	89d959fd6a	Stop gap solution for cython functions breaking in memory monitor (#7687 )	2020-03-21 15:16:12 -07:00
Zhijun Fu	a7a5d172b1	[core] fix bug that actor tasks from reconstructed actor is ignored by scheduling queue (#7637 )	2020-03-21 13:05:24 +08:00
Edward Oakes	58dc70f90e	[minor] Remove get_global_worker(), RuntimeContext (#7638 )	2020-03-20 15:45:29 -05:00
Stephanie Wang	014929e658	[core] Pin lineage of plasma objects that are still in scope (#7499 ) * Add a lineage_ref_count to References * Refactor TaskManager to store TaskEntry as a struct * Refactor to fix deadlock between TaskManager and ReferenceCounter Add references to task specs * Pin TaskEntries and References in the lineage of any ObjectIDs in scope * Fix deadlock, convert num_plasma_returns to a set of object IDs * fix unit tests * Feature flag * Do not release lineage for objects that were promoted to plasma * fix build * fix build * Remove num executions * Simplify num return values * Remove unused * doc * Set num returns * Move lineage pinning flag to ReferenceCounter * comments * Fixes * Remove irrelevant test (replaced by ref counting tests)	2020-03-20 10:56:43 -07:00
ZhuSenlin	7d08b418fc	fix test_worker_stats (#7655 ) * fix test_worker_stats * fix lint error * fix lint error Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>	2020-03-20 14:53:40 +08:00
mehrdadn	e69664b74b	Miscellaneous Windows compatibility bugfixes (#7658 ) * Windows compatibility bug fixes * Use WSASend/WSARecv as WSASendMsg/WSARecvMsg do not work with TCP sockets * Clean up some TODOs * Fix duplicate compilations * RedisAsioClient boost::asio::error::connection_reset Co-authored-by: Mehrdad <noreply@github.com>	2020-03-19 19:32:53 -07:00
Eric Liang	5a112ab212	Remove object store memory cap (#7654 )	2020-03-19 16:00:30 -07:00
Clark Zinzow	c37f6e745a	Remove duplicate jsonschema from setup.py (#7665 )	2020-03-19 13:12:47 -07:00
Stephanie Wang	b499100a88	Enable distributed ref counting by default (#7628 ) * enable * Turn on eager eviction * Shorten tests and drain ReferenceCounter * Don't force kill actor handles that have gone out of scope, lint * Fix locks * Cleanup Plasma Async Callback (#7452) * [rllib][tune] fix some nans (#7611) * Change /tmp to platform-specific temporary directory (#7529) * [Serve] UI Improvements (#7569) * bugfix about test_dynres.py (#7615) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * Java call Python actor method use actor.call (#7614) * bug fix about useage of absl::flat_hash_map::erase and absl::flat_hash_set::erase (#7633) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [Java] Make both `RayActor` and `RayPyActor` inheriting from `BaseActor` (#7462) * [Java] Fix the issue that the cached value in `RayObject` is serialized (#7613) * Add failure tests to test_reference_counting (#7400) * Fix typo in asyncio documentation (#7602) * Fix segfault * debug * Force kill actor * Fix test	2020-03-18 22:39:21 -07:00
fangfengbin	fca9dc73e1	Fix test_raylet_pending_tasks test case failed (#7636 )	2020-03-19 11:09:38 +08:00
Seung Hyeon, Kim	ee49f4a875	[tune] Fix an example for _Brackets of async hyperband scheduler (#7538 )	2020-03-18 19:06:32 -07:00
Richard Liaw	ea10cd212c	[tune] add accessible trial_info (#7378 ) * add accessible trial_info * trial name and info * doc * fix gp * Update doc/source/tune-package-ref.rst * Apply suggestions from code review * fix * trial * fixtest * testfix	2020-03-17 23:44:18 -07:00
Eric Liang	745b9d643d	First pass at `ray memory` command for memory debugging (#7589 )	2020-03-17 20:45:07 -07:00
Edward Oakes	c1b0f9ccdf	Add failure tests to test_reference_counting (#7400 )	2020-03-17 10:30:21 -05:00
fyrestone	7697ea2be2	Java call Python actor method use actor.call (#7614 )	2020-03-17 14:52:43 +08:00
Simon Mo	ce0885a897	[Serve] UI Improvements (#7569 )	2020-03-16 22:23:16 -07:00
mehrdadn	a0700e2f86	Change /tmp to platform-specific temporary directory (#7529 )	2020-03-16 18:10:14 -07:00
Eric Liang	797e6cfc2a	[rllib][tune] fix some nans (#7611 )	2020-03-16 11:19:58 -07:00
ijrsvt	46953c53b1	Cleanup Plasma Async Callback (#7452 )	2020-03-16 10:12:44 -07:00
Scott Graham	37e4d29f87	[autoscaler] Adding Azure Support (#7080 ) * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * minor fixes * first working version :) * added tag support * added msi identity intermediate * enable MSI through user managed identity * updated schema * extend yaml schema remove service principal code add re-use of managed user identity * fix rg_id * fix logging * replace manual cluster yaml validation with json schema - improved error message - support for intellisense in VSCode (or other IDEs) * run linting * updating yaml configs and formatting * updating yaml configs and formatting * typo in example config * pulling default config from example-full * resetting min, init worker prop * adding docs for azure autoscaler and fixing status * add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment * fix for default subscription in azure node provider * vm dev image build * minor change * keeping example-full.yaml in autoscaler/azure, updating azure example config * linting azure config * extending retries on azure config * lint * support for internal ips, fix to azure docs, and new azure gpu example config * linting * Update python/ray/autoscaler/azure/node_provider.py Co-Authored-By: Richard Liaw <rliaw@berkeley.edu> * revert_this * remove_schema * updating configs and removing ssh keygen, tweak azure node provider terminate * minor tweaks Co-authored-by: Markus Cozowicz <marcozo@microsoft.com> Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-15 14:48:27 -07:00
Simon Mo	3f1fcaa024	Blocking ray.get/wait inside async context will warn instead of error (#7262 )	2020-03-14 22:02:30 -07:00
Kai Yang	630e48967d	[Java] Allow passing internal config from raylet to Java worker (#7532 )	2020-03-15 12:03:38 +08:00
Stephanie Wang	53549314c5	[core] Option to fallback to LRU on OutOfMemory (#7410 ) * Add a test for LRU fallback * Update error message * Upgrade arrow to master * Integrate with arrow * Revert "Bazel mirrors (#7385)" This reverts commit `44aded5272`. * Don't LRU evict * Revert "Revert "Bazel mirrors (#7385)"" This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3. * Add lru_evict flag * fix internal config * Fix * upgrade arrow * debug * Set free period in config for lru_evict, override max retries to fix test * Fix test? * fix test * Revert "debug" This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017. * fix exception str * Fix ref count test * Shorten travis test?	2020-03-14 11:28:43 -07:00
Anthony Yu	094125cf03	[tune] Dragonfly integration ask tell nit (#7593 ) * Add sample example * Copy relevant lines of ask from inherited Optimizer * Ignore strategy * Additional changes * Add DragonflySearch for tune connector for Dragonfly * Add example and fix small errors * lint * Remove skopt references * Update example based off of Dragonfly changes * Edit example for final Dragonfly edits * Formatting and documentation edits * Add documentation and add to test pipeline * Address PR comments * Fix Jenkins test * Adjust Dragonfly to PR#7366 * Lint * fix_tests * Minor changes to ordering Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-13 15:27:03 -07:00
Kai Yang	d6e8f47065	Add a flag to disable reconstruction for a killed actor (#7346 )	2020-03-13 19:10:21 +08:00
Ujval Misra	6022eb53c4	[tune] Use newest checkpoint in normal operation (#7563 ) * Use persistent checkpoint for failures * Fix test * Add unpause test * move test * Fix tests * remove debug statement * Mark test as flaky	2020-03-12 22:21:42 -07:00
ZhuSenlin	b663bc6d67	Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true (#7166 )	2020-03-12 22:13:56 +08:00
Eric Liang	f5d12a958b	[rllib] Port Ape-X to distributed execution API (#7497 )	2020-03-12 00:54:08 -07:00
Kai Yang	932a749fa9	Fix the `java_worker_options` parameter (#7537 ) * fix Java CI * Minor fix * move json.loads out of build_java_worker_command * lint * fix cross language test	2020-03-12 10:44:23 +08:00

1 2 3 4 5 ...

2221 commits