hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Simon Mo	ce0885a897	[Serve] UI Improvements (#7569 )	2020-03-16 22:23:16 -07:00
mehrdadn	a0700e2f86	Change /tmp to platform-specific temporary directory (#7529 )	2020-03-16 18:10:14 -07:00
Eric Liang	797e6cfc2a	[rllib][tune] fix some nans (#7611 )	2020-03-16 11:19:58 -07:00
ijrsvt	46953c53b1	Cleanup Plasma Async Callback (#7452 )	2020-03-16 10:12:44 -07:00
Simon Mo	45ce40e5d4	Disable Travis Disk Cache (#7612 ) There are some file sizes and memory issue with bazel disk cache we will disable the cache and use remote cache exclusively for now	2020-03-16 00:18:01 -07:00
Scott Graham	37e4d29f87	[autoscaler] Adding Azure Support (#7080 ) * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * minor fixes * first working version :) * added tag support * added msi identity intermediate * enable MSI through user managed identity * updated schema * extend yaml schema remove service principal code add re-use of managed user identity * fix rg_id * fix logging * replace manual cluster yaml validation with json schema - improved error message - support for intellisense in VSCode (or other IDEs) * run linting * updating yaml configs and formatting * updating yaml configs and formatting * typo in example config * pulling default config from example-full * resetting min, init worker prop * adding docs for azure autoscaler and fixing status * add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment * fix for default subscription in azure node provider * vm dev image build * minor change * keeping example-full.yaml in autoscaler/azure, updating azure example config * linting azure config * extending retries on azure config * lint * support for internal ips, fix to azure docs, and new azure gpu example config * linting * Update python/ray/autoscaler/azure/node_provider.py Co-Authored-By: Richard Liaw <rliaw@berkeley.edu> * revert_this * remove_schema * updating configs and removing ssh keygen, tweak azure node provider terminate * minor tweaks Co-authored-by: Markus Cozowicz <marcozo@microsoft.com> Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-15 14:48:27 -07:00
Simon Mo	3f1fcaa024	Blocking ray.get/wait inside async context will warn instead of error (#7262 )	2020-03-14 22:02:30 -07:00
fangfengbin	6b37be9677	[GCS]Add job id when operating gcs table (#7592 )	2020-03-15 12:04:04 +08:00
Kai Yang	630e48967d	[Java] Allow passing internal config from raylet to Java worker (#7532 )	2020-03-15 12:03:38 +08:00
mehrdadn	a87199d240	Fix cyclic dependency between ray/util and ray/common (#7581 ) * Fix cyclic dependency Headers in ray/util should not depend on those in ray/common * Move random generations to ray/common/test_util.h * Add license header Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2020-03-14 12:44:53 -07:00
tison	ffeab5d2bf	Support configurable python executable in format.sh (#7513 )	2020-03-14 12:27:41 -07:00
Eric Liang	dd70720578	[rllib] Rename sample_batch_size => rollout_fragment_length (#7503 ) * bulk rename * deprecation warn * update doc * update fig * line length * rename * make pytest comptaible * fix test * fi sys * rename * wip * fix more * lint * update svg * comments * lint * fix use of batch steps	2020-03-14 12:05:04 -07:00
Stephanie Wang	53549314c5	[core] Option to fallback to LRU on OutOfMemory (#7410 ) * Add a test for LRU fallback * Update error message * Upgrade arrow to master * Integrate with arrow * Revert "Bazel mirrors (#7385)" This reverts commit `44aded5272`. * Don't LRU evict * Revert "Revert "Bazel mirrors (#7385)"" This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3. * Add lru_evict flag * fix internal config * Fix * upgrade arrow * debug * Set free period in config for lru_evict, override max retries to fix test * Fix test? * fix test * Revert "debug" This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017. * fix exception str * Fix ref count test * Shorten travis test?	2020-03-14 11:28:43 -07:00
Eric Liang	52cf77f5a9	[rllib] SAC no_done_at_end should default to False (#7594 ) * update * update doc * stochastic * cleanu	2020-03-14 11:16:54 -07:00
Eric Liang	c3a8ba399f	[rllib] Enable distributed exec api for A2C, A3C, PG by default (#7580 )	2020-03-13 18:48:41 -07:00
Anthony Yu	094125cf03	[tune] Dragonfly integration ask tell nit (#7593 ) * Add sample example * Copy relevant lines of ask from inherited Optimizer * Ignore strategy * Additional changes * Add DragonflySearch for tune connector for Dragonfly * Add example and fix small errors * lint * Remove skopt references * Update example based off of Dragonfly changes * Edit example for final Dragonfly edits * Formatting and documentation edits * Add documentation and add to test pipeline * Address PR comments * Fix Jenkins test * Adjust Dragonfly to PR#7366 * Lint * fix_tests * Minor changes to ordering Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-13 15:27:03 -07:00
Qing Wang	d6365c2586	[Java] Enable stress test. (#7596 )	2020-03-13 21:02:13 +08:00
Kai Yang	d6e8f47065	Add a flag to disable reconstruction for a killed actor (#7346 )	2020-03-13 19:10:21 +08:00
Qing Wang	575c89cf47	[Java] Pass large object by reference (#7595 )	2020-03-13 18:38:03 +08:00
Sven Mika	552cfb37ea	[RLlib] Fix bugs and speed up SegmentTree	2020-03-13 01:03:07 -07:00
Ujval Misra	6022eb53c4	[tune] Use newest checkpoint in normal operation (#7563 ) * Use persistent checkpoint for failures * Fix test * Add unpause test * move test * Fix tests * remove debug statement * Mark test as flaky	2020-03-12 22:21:42 -07:00
Qing Wang	f4656d8cc3	[Java] Enable direct call by default. (#7408 ) * WIP * Address comments. * Linting * Fix * Fix * Fix test * Fix * Fix single process ci * Fix ut * Update java/test/src/main/java/org/ray/api/test/PlasmaFreeTest.java * Address comments * Fix linting * Minor update comments. * Fix streaming CI	2020-03-13 12:25:30 +08:00
Tianyi Chen	6993a471f1	[Streaming] Move resource-manager and scheduler to master package. (#7582 )	2020-03-13 12:24:37 +08:00
micafan	cc91ed57dc	[core] Fix losing task state when giving up forward task. (#7525 ) * fix NodeManager::Forward task bug on error * fix lint * revert spillback task forward	2020-03-13 11:49:44 +08:00
Edward Oakes	768d0b3b3f	Allocate a buffer of 100 calls for each RPC handler (#7573 )	2020-03-12 12:05:30 -07:00
Sven Mika	f165766813	[RLlib] Bug: If trainer config `horizon` is provided, should try to increase env steps to that value. (#7531 )	2020-03-12 11:03:37 -07:00
Sven Mika	80d314ae5e	[RLlib] Add all agents to `rllib rollout` tests. (#7534 )	2020-03-12 11:02:51 -07:00
ZhuSenlin	b663bc6d67	Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true (#7166 )	2020-03-12 22:13:56 +08:00
fangfengbin	428fb79b27	Fix streaming compile bug (#7577 ) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>	2020-03-12 17:26:45 +08:00
Eric Liang	f5d12a958b	[rllib] Port Ape-X to distributed execution API (#7497 )	2020-03-12 00:54:08 -07:00
fangfengbin	4c834b9d68	Fix the issue that gcs service client ignores error status code (#7539 ) * add gcs reply status * rebase master * use macro to simplify * convert status in gcs rpc client * define a Status message in probobuf Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>	2020-03-12 15:08:29 +08:00
Sven Mika	20ef4a8603	[RLlib] Cleanup/unify all test cases. (#7533 )	2020-03-11 20:39:47 -07:00
Sven Mika	dded5b6d22	[RLlib] ES `env_config` is not a EnvContext object (e.g. does not contain `worker_index`). (#7560 )	2020-03-11 20:33:20 -07:00
Sven Mika	bc120730e5	[RLlib] PPO(torch) on CartPole not tuned well enough for consistent learning (#7556 )	2020-03-11 20:31:27 -07:00
Kai Yang	932a749fa9	Fix the `java_worker_options` parameter (#7537 ) * fix Java CI * Minor fix * move json.loads out of build_java_worker_command * lint * fix cross language test	2020-03-12 10:44:23 +08:00
Markus Cozowicz	ba1b081477	Azure Portal cluster deployment \| Support spot instances (#7558 ) * added priority option * added head node priority * upgrade api version	2020-03-11 18:46:11 -07:00
Simon Mo	31d63d3ca7	Fix global state actors() call (#7567 )	2020-03-11 16:59:50 -07:00
Richard Liaw	b38ed4be71	[raysgd] Fix More Docs (#7565 )	2020-03-11 14:17:47 -07:00
Richard Liaw	d046faeb9c	[sgd] Readme fix (#7564 ) * readme fix * replicas	2020-03-11 13:40:18 -07:00
Richard Liaw	b70f31339c	[sgd] Benchmark Fixes (#7553 ) * fix * fix	2020-03-11 13:08:27 -07:00
Markus Cozowicz	ea99063c10	added json schema to setup.py (#7554 )	2020-03-11 09:53:21 -07:00
mehrdadn	3b9caa98ba	Fix fate-sharing warning (#7545 ) * Fix kernel_fate_sharing being None instead of False * Remove fate-sharing warning Co-authored-by: Mehrdad <noreply@github.com>	2020-03-11 08:27:54 -07:00
Richard Liaw	fbac256982	[sgd] Add benchmarks (#7454 ) * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * save * failures * fixes * trainer * run test * operator * code * op * ok done * operator * sgd test fixes * ok * trainer * format * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update doc/source/raysgd/raysgd_pytorch.rst * docstring * dcgan * doc * commits * nit * testing * revert * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * benchmarks * rename * remove some args * better metrics output * fix up the benchmark * benchmark-yaml * horovod-benchmark * benchmarks * Remove benchmark code for cleanups * benchmark-code * nits * benchmark yamls * benchmark yaml * ok * ok * ok * benchmark * nit * finish_bench * makedatacreator * relax * metrics * autosetsampler * profile * movements * OK * smoothen * fix * nitdocs * loss * envflag * comments * nit * format * visible * images * move_images * fix * rernder * rrender * rest * multgpu * fix * nit * finish * extrra * setup * revert Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Maksim Smolin <maximsmol@gmail.com>	2020-03-11 01:09:08 -07:00
Markus Cozowicz	49439611f1	[autoscaler] Replace cluster yaml validation with json schema v… (#7261 ) * replace manual cluster yaml validation with json schema - improved error message - support for intellisense in VSCode (or other IDEs) - run linting - moved schema to ray/autoscaler - fixed typo - remove importlib dependency * Update python/ray/autoscaler/autoscaler.py * read * restrict allowed properties * added unit test for invalid yaml added ray[test] package (remove pytest from default dependencies) * updated autoscaler test to use ValidationError exception * add missing dependency * added pytest * replace manual cluster yaml validation with json schema - improved error message - support for intellisense in VSCode (or other IDEs) - run linting - moved schema to ray/autoscaler - fixed typo - remove importlib dependency * Update python/ray/autoscaler/autoscaler.py * read * restrict allowed properties * added unit test for invalid yaml added ray[test] package (remove pytest from default dependencies) * updated autoscaler test to use ValidationError exception * add missing dependency * added pytest * removed parameterized dependency reverted ray[test] intro * removed parameterized * fix_tests * format Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-10 18:58:55 -07:00
Richard Liaw	6163b21458	[raysgd] Better user errors! (#7546 ) * format * callable * Update python/ray/util/sgd/torch/torch_trainer.py Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update python/ray/util/sgd/torch/torch_trainer.py Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * data * torchtrainer * num_rep Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2020-03-10 18:58:19 -07:00
Edward Oakes	7b609ca211	Remove instances of 'raise Exception' (#7523 )	2020-03-10 17:51:22 -07:00
Stephanie Wang	fdb528514b	[core] Ref counting for actor handles (#7434 ) * tmp * Move Exit handler into CoreWorker, exit once owner's ref count goes to 0 * fix build * Remove __ray_terminate__ and add test case for distributed ref counting * lint * Remove unused * Fixes for detached actor, duplicate actor handles * Remove unused * Remove creation return ID * Remove ObjectIDs from python, set references in CoreWorker * Fix crash * Fix memory crash * Fix tests * fix * fixes * fix tests * fix java build * fix build * fix * check status * check status	2020-03-10 17:45:07 -07:00
Edward Oakes	119a303ea0	Remove static concurrency limit from gRPC server (#7544 )	2020-03-10 16:27:02 -07:00
Edward Oakes	dbbf0c0e70	Add Apache 2 license to C++ files (#7520 )	2020-03-10 16:07:17 -07:00
Eric Liang	be48e1964b	[rllib] Fix per-worker exploration in Ape-X; make more kwargs required for future safety (#7504 ) * fix sched * lintc * lint * fix * add unit test * fix * format * fix test * fix test	2020-03-10 11:14:14 -07:00

1 2 3 4 5 ...

4269 commits