hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 04:46:38 -04:00

Author	SHA1	Message	Date
mattearllongshot	681e8cd3fd	[autoscaler] Add an initial_workers option (#3530 ) ## What do these changes do? This option goes along with `min_workers`, and `max_workers`. When the cluster is first brought up (or when it is refreshed with a subsequent `ray up`) this number of nodes will be started. It's a workaround for issues of scaling (see related issues) where it can take a long time (or forever in the case where the head node has `--num-cpus 0`) to scale up a cluster in response to increasing demand. ## Related issue number Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106	2019-01-05 17:58:42 -08:00
Robert Nishihara	067976ad3d	Push a warning to all users when large number of workers have been started. (#3645 ) * Push a warning to all users when large number of workers have been started. * Add test. * Fix bug. * Give warning when worker starts instead of when worker registers. * Fix * Fix tests	2019-01-05 13:27:32 -08:00
Eric Liang	03fe760616	[rllib] Model self loss isn't included in all algorithms (#3679 )	2019-01-04 22:30:35 -08:00
Richard Liaw	960a943503	[tune] Fault Tolerance: handle lost checkpoints by restart (#3657 ) Checks that node failure with lost checkpoints does not crash. Also adds test.	2019-01-04 22:05:27 -08:00
Eric Liang	7db1f3be2a	[tune] resume=False by default but print a tip to set resume="prompt" + jenkins fix (#3681 )	2019-01-04 17:23:19 -08:00
Kristian Hartikainen	747b117929	[tune] Tweak/allow nested pbt mutations (#3455 ) * Fix warning text in pbt logger * Allow nested mutations in pbt by recursing explore function * Add test for nested pbt mutation * Update pbt explore to only call custom explore on top level * fix test	2019-01-04 13:51:11 -08:00
Robert Nishihara	cd80891ddb	Try to figure out the memory limit in a docker container. (#3605 ) * Try to figure out the memory limit in a docker container. * Update comment * Fix * Fix	2019-01-03 23:07:24 -08:00
Robert Nishihara	586a5c9ffa	Limit default redis max memory to 10GB. (#3630 ) * Limit Redis max memory to 10GB/shard by default. * Update stress tests. * Reorganize * Update * Add minimum cap size for object store and redis. * Small test update.	2019-01-03 13:23:54 -08:00
Yuhong Guo	4b23a34c93	Fix multi-thread problem of function manager and Jenkins test (#3648 )	2019-01-03 17:05:13 +08:00
Eric Liang	ca864faece	[rllib] Documentation for I/O API and multi-agent support / cleanup (#3650 )	2019-01-03 15:15:36 +08:00
opherlieber	2177e2f410	[rllib] Agent: Allow unknown subkeys for custom_resources_per_worker (#3639 ) * RLLib Agent: Allow unknown subkeys for custom_resources_per_worker * Update agent.py	2019-01-03 14:19:59 +08:00
Eric Liang	47d36d7bd6	[rllib] Refactor pytorch custom model support (#3634 )	2019-01-03 13:48:33 +08:00
Robert Nishihara	b6bcd18d65	Split profile table among many keys in the GCS. (#3676 ) * Divide profile table among many keys in GCS. * Fix, and remove --collect-profiling-data arg. * Remove reference in doc.	2019-01-02 21:33:01 -08:00
Si-Yuan	93d54110f8	Prevent overriding faulthandler settings (#3668 ) This change ensures that Ray set up fault handlers only if it has not been enabled by other applications. Otherwise some applications could face strange issues when using Ray, and some unittests using xml runners will fail.	2018-12-31 16:36:26 -08:00
Yuhong Guo	c9b8ecca51	Add RayParams to refactor the parameters used by ray python. (#3558 )	2018-12-29 22:04:27 +08:00
Devin Petersohn	eb1e5fa2cf	Fixing Python2 compatibility issues. Adding inline docs (#3656 )	2018-12-28 22:53:28 -08:00
Richard Liaw	aad3c50e2d	[tune] Cluster Fault Tolerance (#3309 ) This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes. Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.	2018-12-29 11:42:25 +08:00
Richard Liaw	ac792d70c8	[rllib] Add starcraft multiagent env as example (#3542 )	2018-12-27 10:00:32 +08:00
Tianming Xu	b4f61dfd50	[rllib] Export policy model checkpoint (#3637 ) * Export policy model checkpoint * update comment	2018-12-27 08:43:06 +09:00
Richard Liaw	6e2d7a9ba1	[tune] Support Configuration Merging (#3584 ) * merge configs * deep merge * lint * add resolve * test	2018-12-26 20:07:11 +09:00
Stan Wang	4ce3818be5	Average aggregated gradients before put in plasma store (#3631 )	2018-12-26 20:03:11 +09:00
Yuhong Guo	1b98fb8238	Fix Jenkins test failures and function descriptor bug. (#3569 ) ## What do these changes do? 1. Fix the Jenkins test failure by add driver id to Actor GCS Key. 2. Move `object_manager_test.py` from Jenkins to Travis.	2018-12-25 23:31:44 -08:00
Robert Nishihara	5426234cd8	Update documentation to reflect 0.6.1 release. (#3622 )	2018-12-24 11:10:04 -08:00
nam-cern	3d8f56409b	Ensure numpy is at least 1.10.4 in setup.py (#2462 ) In the build script, numpy is specifically set at 1.10.4. We should also ensure that it is indeed the case in `setup.py`.	2018-12-24 11:01:25 -08:00
Eric Liang	9f63119a83	[rllib] Allow development without needing to compile Ray (#3623 ) * wip * lint * wip * wip * rename * wip * Cleaner handling of cli prompt	2018-12-24 18:08:23 +09:00
Devin Petersohn	c13b2685f5	[modin] Append to path to avoid namespace collision on development branches (#3621 )	2018-12-23 23:58:56 -08:00
Alexey Tumanov	9b8d7573fe	bump version from 0.6.0 to 0.6.1 (#3610 )	2018-12-23 17:03:42 -08:00
Robert Nishihara	bb7ca3bae7	Upgrade flatbuffers version to 1.10.0. (#3559 ) * Upgrade flatbuffers version to 1.10.0. * Temporarily change ray.utils.decode for backwards compatibility.	2018-12-23 14:56:34 -08:00
Tianming Xu	deb26b954e	[rllib] Export tensorflow model of policy graph (#3585 ) * Export tensorflow model of policy graph * Add tests,examples,pydocs and infer extra signatures from existing methods * Add example usage in export_policy_model comment * Fix lint error * Fix lint error * Fix lint error	2018-12-22 17:35:25 +09:00
Eric Liang	ddc97864df	[rllib] Add requested clarifications to test requirement of contrib docs (#3589 )	2018-12-21 11:02:02 -08:00
Richard Liaw	e046a5c767	[tune] resources_per_trial from trial_resources (#3580 ) Renaming variable due to user errors.	2018-12-20 19:00:47 -08:00
Devin Petersohn	a174a46e02	Allowing multiple users to access the /tmp/ray file at the same time (#3591 ) * Allowing multiple users to access the /tmp/ray file at the same time Previous sequence that caused this issue: * User A starts ray with `ray.init` when /tmp/ray does not exist * User B starts ray with `ray.init` and /tmp/ray now exists User B will get a permissions error Checking the permissions, /tmp/ray is 700 I have identified a race condition in `try_to_create_directory` * Multiple processes try to create /tmp/ray at the same time * chmod is either silently erroring or working properly within the race condition Resolution: Move chmod outside of the check for whether the directory exists or not. * Adding try except for users who do not own the directory	2018-12-20 18:46:54 -08:00
Stephanie Wang	34bab6291c	Cleanup actor handle pickling code (#3560 ) * Cleanup actor handle pickling code * remove unused * fix * lint	2018-12-20 16:37:21 -08:00
Eric Liang	6bb1103930	[rllib] Avoid sample wastage with bad PPO configurations (#3552 ) ## What do these changes do? Previously we logged a warning if the PPO configuration would waste many samples. However, this didn't apply in the case of long episodes in `complete_episodes` batch mode, and also the amount of waste is up to 2x in common cases. This pr: - Estimates the number of sampling tasks needed to avoid over-sampling. - Collects all sample results and never discards any. In principle this can degrade performance at large scale if certain machines are slower. Add a config flag to enable this legacy behavior. ## Related issue number Closes: https://github.com/ray-project/ray/issues/3549	2018-12-20 10:50:44 -08:00
Richard Liaw	ac48a58e4e	[tune] Reduce scope of variant generator (#3583 ) This PR provides a better error message when the generate_variants code breaks. Also removes a comment about nesting dependencies. This comes mainly as a hotfix solution for #3466. We should leave that issue open for future contribution 🙂	2018-12-20 10:48:28 -08:00
Eric Liang	303883a3b6	[rllib] [rfc] add contrib module and guideline for merging (#3565 ) This adds guidelines for merging code into `rllib/contrib` vs `rllib/agents`. Also, clean up the agent import code to make registration easier.	2018-12-20 10:44:34 -08:00
adoda	cf0c4745f4	[rllib] support running older version tensorflow(version < 1.5.0) (#3571 )	2018-12-19 20:27:24 -08:00
Robert Nishihara	a5309bec7c	Make README render properly on PyPI. (#3578 ) * Make README render properly in pypi. * Add small logo * temporary fix * smaller image * Remove image size. * Add author and email to setup.py.	2018-12-19 18:41:09 -08:00
Eric Liang	ffa6ee3ec8	[rllib] streaming minibatching for IMPALA (#3402 ) * mb impala * fix * paropt * update * cpu warn * on cpu * fix mb * doc * docs * comment * larger num * early release * remove grad clip * only check loader count in multi gpu mode * revert bad multigpu changes * num sgd iter * comment * reuse optimizer * add test * par load test * loosen test * Update run_multi_node_tests.sh * fix local mode * Update agent.py	2018-12-19 02:23:29 -08:00
Alexey Tumanov	c4cba98c75	Remove deprecation warnings when running actor tests (#3563 ) * remove deprecation warnings when running actor tests * replacing logger.warn with logger.warning * Update worker.py * Update policy_client.py * Update compression.py	2018-12-18 17:04:51 -08:00
Yuhong Guo	fb33fa9097	Enable function_descriptor in backend to replace the function_id (#3028 )	2018-12-18 18:53:59 -05:00
Yuhong Guo	75ddf7cca4	Fix 2 small bugs (#3573 )	2018-12-18 14:52:21 -05:00
Eric Liang	db0dee573e	[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548 )	2018-12-18 10:40:01 -08:00
opherlieber	854b06854f	remove auto-concat of rollouts in AsyncSampler (#3556 ) * remove auto-concat of rollouts in AsyncSampler * remote auto-concat test * remove unused reference	2018-12-17 13:54:52 -08:00
Robert Nishihara	417c7f2d6f	Update arrow and remove plasma_manager references. (#3545 )	2018-12-15 23:36:02 -08:00
Philipp Moritz	b3bf608608	Update arrow to reduce plasma IPCs. (#3497 )	2018-12-14 23:49:37 -05:00
Richard Liaw	de3fdeb5b5	[autoscaler] Fix Error Handling for botocore (#3534 ) Unfortunately Boto generates error classes dynamically, so this catches the expected error and raises the error if it is the wrong class. Closes #3533.	2018-12-14 00:20:49 -08:00
Hao Chen	e7b51cbd1b	[xray] Implement Actor Reconstruction (#3332 ) * Implement Actor Reconstruction * fix * fix actor handle __del__ * fix lint * add comment * Remove actorCreationDummyObjectId * address comments * fix * address comments * avoid copy * change log to debug * fix error name	2018-12-13 21:28:58 -08:00
Si-Yuan	84fae57ab5	Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511 ) * refactoring * fix bugs * create client class * create client class for java; bug fix * remove legacy code * improve code by using std::string, std::unique_ptr rename private fields and removing legacy code * rename class * improve naming * fix * rename files * fix names * change name * change return types * make a mutex private field * fix comments * fix bugs * lint * bug fix * bug fix * move too short functions into the header file * Loose crash conditions for some APIs. * Apply suggestions from code review Co-Authored-By: suquark <suquark@gmail.com> * format * update * rename python APIs * fix java * more fixes * change types of cpython interface * more fixes * improve error processing * improve error processing for java wrapper * lint * fix java * make fields const * use pointers for [out] parameters * fix java & error msg * fix resource leak, etc.	2018-12-13 13:39:10 -08:00
Chunyang Wen	5dcc333199	[sgd] Modify: add interface for model (#3458 ) * Modify: add interface for model * Modify: remove single quota and build; add metrics * Modify: flatten into list of dict * Update distributed_sgd.rst * Modify: update format with scripts/format.sh * Update sgd_worker.py	2018-12-12 21:23:25 -08:00

1 2 3 4 5 ...

1008 commits