hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 05:16:49 -04:00

Author	SHA1	Message	Date
Robert Nishihara	5f71751891	API cleanups. Remove worker argument. Remove some deprecated arguments. (#4025 ) * Remove worker argument from API methods. * Remove deprecated arguments and deprecate redirect_output and redirect_worker_output. * Fix	2019-02-15 10:49:16 -08:00
Philipp Moritz	077ffd99bf	Bump version from 0.6.3 to 0.7.0.dev0 in docs and .yaml (#4042 )	2019-02-14 12:08:48 -08:00
Philipp Moritz	810cc17062	Fix LRU eviction of client notification datastructure (#4021 ) * convert notification_key map to C++ datastructure * fix crash and add debug string * clean notification map up (this was a bug before) * remove checks * add jenkins test * linting * fixes * properly erase * clean up * linting * Update test_wait_hanging.py * Update run_multi_node_tests.sh * increase redis_max_memory * fix dat jenkins * update * Update run_multi_node_tests.sh	2019-02-13 22:20:27 -08:00
Eric Liang	2dccf383dd	[rllib] Basic infrastructure for off-policy estimation (IS, WIS) (#3941 )	2019-02-13 16:25:05 -08:00
Kristian Hartikainen	729d0b2825	[autoscaler] docker run options (#3921 ) Adds support for docker options, allowing for use of nvidia-docker. Closes #2657.	2019-02-13 12:26:28 -08:00
Stephanie Wang	4347ab644e	Use Redis lists in the GCS instead of zset (#4023 ) * Convert zset to list * Remove object evictions map from the object directory, yay * comments * Fix tests	2019-02-13 10:32:57 -08:00
bjg2	0e37ac6d1d	[wingman -> rllib] Remote and entangled environments (#3968 ) * added all our environment changes * fixed merge request comments and remote env * fixed remote check * moved remote_worker_envs to correct config section * lint * auto wrap impl * fix * fixed the tests	2019-02-13 10:08:26 -08:00
Philipp Moritz	b3f72e8a75	Add regression tests for dataclass serialization (#3984 )	2019-02-13 09:07:03 -08:00
Hao Chen	f31a79f3f7	Implement actor checkpointing (#3839 ) * Implement Actor checkpointing * docs * fix * fix * fix * move restore-from-checkpoint to HandleActorStateTransition * Revert "move restore-from-checkpoint to HandleActorStateTransition" This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12. * resubmit waiting tasks when actor frontier restored * add doc about num_actor_checkpoints_to_keep=1 * add num_actor_checkpoints_to_keep to Cython * add checkpoint_expired api * check if actor class is abstract * change checkpoint_ids to long string * implement java * Refactor to delay actor creation publish until checkpoint is resumed * debug, lint * Erase from checkpoints to restore if task fails * fix lint * update comments * avoid duplicated actor notification log * fix unintended change * add actor_id to checkpoint_expired * small java updates * make checkpoint info per actor * lint * Remove logging * Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager * Replace old actor checkpointing tests * Fix test and lint * address comments * consolidate kill_actor * Remove __ray_checkpoint__ * fix non-ascii char * Loosen test checks * fix java * fix sphinx-build	2019-02-13 19:39:02 +08:00
Si-Yuan	21472b890a	Integrate "tempfile_service" into "ray.node.Node" (#3953 )	2019-02-12 17:34:04 -08:00
Adi Zimmerman	dac1969647	[tune] Add Nevergrad to Tune (#3985 )	2019-02-12 11:00:04 -08:00
Adi Zimmerman	9797028a91	[tune] Add scikit-optimize to Tune (#3924 )	2019-02-11 17:06:02 -08:00
Ion	3c32343c63	Ray signal (#3624 )	2019-02-11 10:14:48 -08:00
Zhijun Fu	7097ba393b	protect raylet against bad messages (#4003 ) * protect raylet against bad messages * address comments * linting and regression test	2019-02-12 00:39:38 +08:00
Yuhong Guo	5fb1efd60d	Fix CI test failures (#4007 )	2019-02-11 11:01:14 +08:00
Eric Liang	29322c7389	[rllib] Replay buffer for IMPALA should default to 0 slots. (#3971 ) * disable replay * make lq configurable * leak test * Update run_multi_node_tests.sh	2019-02-08 10:03:11 -08:00
Robert Nishihara	6a32b410bb	Update versions from 0.6.2 -> 0.6.3 in the documentation. (#3981 )	2019-02-07 20:57:37 -08:00
Robert Nishihara	ef527f84ab	Stream logs to driver by default. (#3892 ) * Stream logs to driver by default. * Fix from rebase * Redirect raylet output independently of worker output. * Fix. * Create redis client with services.create_redis_client. * Suppress Redis connection error at exit. * Remove thread_safe_client from redis. * Shutdown driver threads in ray.shutdown(). * Add warning for too many log messages. * Only stop threads if worker is connected. * Only stop threads if they exist. * Remove unnecessary try/excepts. * Fix * Only add new logging handler once. * Increase timeout. * Fix tempfile test. * Fix logging in cluster_utils. * Revert "Increase timeout." This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95. * Retry longer when connecting to plasma store from node manager and object manager. * Close pubsub channels to avoid leaking file descriptors. * Limit log monitor open files to 200. * Increase plasma connect retries. * Add comment.	2019-02-07 19:53:50 -08:00
Ion	f987572795	Inline objects (#3756 ) * added store_client_ to object_manager and node_manager * half through... * all code in, and compiling! Nothing tested though... * something is working ;-) * added a few more comments * now, add only one entry to the in GCS for inlined objects * more comments * remove a spurious todo * some comment updates * add test * added support for meta data for inline objects * avoid some copies * Initialize plasma client in tests * Better comments. Enable configuring nline_object_max_size_bytes. * Update src/ray/object_manager/object_manager.cc Co-Authored-By: istoica <istoica@cs.berkeley.edu> * Update src/ray/raylet/node_manager.cc Co-Authored-By: istoica <istoica@cs.berkeley.edu> * Update src/ray/raylet/node_manager.cc Co-Authored-By: istoica <istoica@cs.berkeley.edu> * fiexed comments * fixed various typos in comments * updated comments in object_manager.h and object_manager.cc * addressed all comments...hopefully ;-) * Only add eviction entries for objects that are not inlined * fixed a bunch of comments * Fix test * Fix object transfer dump test * lint * Comments * Fix test? * Fix test? * lint * fix build * Fix build * lint * Use const ref * Fixes, don't let object manager hang * Increase object transfer retry time for travis? * Fix test * Fix test? * Add internal config to java, fix PlasmaFreeTest	2019-02-07 10:32:39 -08:00
Philipp Moritz	3bb65677dc	Use one memory mapped file for plasma (#3871 )	2019-02-06 23:53:05 -08:00
vfdev	b2b8417790	[tune] Improve mnist_pytorch.py example (#3894 ) ## What do these changes do? * Improved --no-cuda handling * Removed deprecated Variable usage ## Related issue number Fixes #3873 <!-- Are there any issues opened that will be resolved by merging this change? -->	2019-02-04 17:59:54 -08:00
William Ma	f067223c4a	Allow Ray processes to be started inside of gdb and tmux. (#3847 )	2019-02-04 15:23:39 -08:00
Andrew Tan	8323419a6d	[tune] Add SigOpt Integration (#3844 )	2019-02-03 18:23:57 -08:00
Luke	002531b199	Enable LZ4 compression in pyarrow build (#3931 ) Enable LZ4 compression in pyarrow build	2019-02-02 14:38:02 -08:00
Yuhong Guo	54cbb4396f	Prepare socket file when start ray (#3925 )	2019-02-02 12:53:36 +08:00
Daniel Edgecumbe	315edab085	[autoscaler] Speedups (#3720 ) - NodeUpdater gets its' IP in parallel now (no longer in __init__) - We use persistent connections in SSH (temp folder created only for ray; ControlMaster) - hash_runtime_conf was performing a pointless hexlify step, wasting time on large files - We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed - AWSNodeProvider caches nodes more aggressively - NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it - AWSNodeProvider batches EC2 update_tags calls - Logging changes throughout to provide standardised timing information for profiling - Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway) ## Related issue number Issue #3599	2019-02-01 02:46:32 -08:00
Peter Schafhalter	62a0a7bdc7	[tune] Add BayesOpt (#3864 ) Adds BayesOpt as a Tune suggestion algorithm.	2019-01-31 16:54:17 -08:00
Richard Liaw	d128636bab	Ray Logging Configuration (#3691 ) * fix logging for autoscaler * module logging * try this for logging * yapf * fix * Initial logging setup * momery * ok * remove basicconfig * catch * remove package logging * print * fix * try_fix * fix 1 * revert rllib * logging level * flake8 * fix * fix * Remove vestigal TODO	2019-01-30 21:01:12 -08:00
Yuhong Guo	c45b91dcca	Make redis module safe without crashing by removing RAY_CHECK (#3855 )	2019-01-29 21:06:31 -08:00
Stephanie Wang	ad9f1721d1	Fix object_manager_test.py::object_transfer_retry test (#3863 )	2019-01-27 13:55:38 -08:00
Yuhong Guo	066fa8abf3	Fix monitor_test.py by waiting for moniter.py to start working (#3840 ) * Wait for moniter.py to start working * Checkout None result in state.py	2019-01-25 18:07:15 +08:00
Si-Yuan	48139cf861	Migrate Python C extension to Cython (#3541 )	2019-01-24 09:17:14 -08:00
Hao Chen	bfcf254e52	Fix: do not treat actor task as failed if the actor will be reconstructed (#3736 )	2019-01-23 23:28:44 -08:00
Robert Nishihara	0b1608a546	Factor out code for starting new processes and test plasma store in valgrind. (#3824 ) * Factor out starting Ray processes. * Detect flags through environment variables. * Return ProcessInfo from start_ray_process. * Print valgrind errors at exit. * Test valgrind in travis. * Some valgrind fixes. * Undo raylet monitor change. * Only test plasma store in valgrind.	2019-01-22 14:59:11 -08:00
Robert Nishihara	9af5a62e05	Give better error for old-style actor classes. (#3793 )	2019-01-17 19:05:04 -08:00
Richard Liaw	0537508106	Bump strings for 0.6.2 (#3801 )	2019-01-17 19:03:27 -08:00
Jones Wong	319c1340cb	[rllib] Develop MARWIL (#3635 ) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil	2019-01-16 19:00:43 -08:00
Richard Liaw	fa99fda2b4	Application Stress Tests (#3612 )	2019-01-16 02:05:16 -08:00
Robert Nishihara	27c20a41a9	Update stress tests. (#3614 ) Starts clusters for testing and has a fallback to kill the cluster if the command fails. The results are then printed at the end of test.	2019-01-13 17:08:51 -08:00
Philipp Moritz	00e9f8d870	Fix pyarrow version (#3760 )	2019-01-13 14:28:23 -08:00
Yuhong Guo	d2cf8561f2	Refactor code about ray.ObjectID. (#3674 ) * Refactor code about ray.ObjectID. * remove from_random and use nil_id instead of constructor * remove id() in hash * Lint and fix * Change driver id to ObjectID * Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()	2019-01-13 01:47:29 -08:00
Richard Liaw	bdeeacc70f	[autoscaler] RecoverUnhealthyWorker mitigation (#3699 ) Increases number of retries for RecoverUnhealthyWorkers Closes #3435.	2019-01-12 14:06:53 -08:00
Robert Nishihara	1480f309c3	[doc] Replace runtest.py with mini_test.py in documentation. (#3750 ) Rename `xray_test.py` to `mini_test.py` and use that in the documentation. Right now we suggest that people run `runtest.py`, but that often doesn't succeed and takes too long.	2019-01-12 14:05:28 -08:00
Robert Nishihara	8723d6b061	Define a Node class to manage Ray processes. (#3733 ) * Implement Node class and move most of services.py into it. * Wait for nodes as they are added to the cluster. * Fix Redis authentication bug. * Fix bug in client table ordering. * Address comments. * Kill raylet before plasma store in test. * Minor	2019-01-11 22:30:38 -08:00
Hao Chen	597abb24ea	Refine multi-threading support (#3672 ) * [Python] refine multi-threading support fix * [java] refine multithreading code fix java * format	2019-01-10 13:58:11 -08:00
Stephanie Wang	04f31db54d	Actor dummy object garbage collection (#3593 ) * Convert UniqueID::nil() to a constructor * Cleanup actor handle pickling code * Add new actor handles to the task spec * Pass in new actor handles * Add new handles to the actor registration * Regression test for actor handle forking and GC * lint and doc * Handle pickled actor handles in the backend and some refactoring * Add regression test for dummy object GC and pickled actor handles * Check for duplicate actor tasks on submission * Regression test for forking twice, fix failed named actor leak * Fix bug for forking twice * lint * Revert "Fix bug for forking twice" This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac. * Add new actor handles when task is assigned, not finished * Remove comment * remove UniqueID() * Updates * update * fix * fix java * fixes * fix	2019-01-09 10:37:11 -08:00
Robert Nishihara	d1e21b702e	Change timeout from milliseconds to seconds in ray.wait. (#3706 ) * Change timeout from milliseconds to seconds in ray.wait. * Suppress warning. * Suppress warning. * Add prominent warning in API documentation.	2019-01-08 21:32:08 -08:00
Peter Schafhalter	5945b92fd3	[sgd] Add checkpointing (#3638 )	2019-01-08 15:29:30 -08:00
Robert Nishihara	5e76d52868	Improve cluster.wait_for_nodes() API. (#3712 ) * Separate out functionality for querying client table and improve cluster.wait_for_nodes() API. * Linting * Add back logging statements. * info -> debug	2019-01-07 21:26:58 -08:00
Robert Nishihara	c9d70f0dda	Remove num_local_schedulers argument from ray.worker._init. (#3704 ) * Remove num_local_schedulers argument from ray.worker._init. * Fix * Fix tests.	2019-01-07 12:44:49 -08:00

1 2 3 4 5 ...

588 commits