hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 12:56:46 -04:00

Author	SHA1	Message	Date
Hao Chen	1bb20badec	[Java] Fix bug when actor creation task fails (#3740 ) * [Java] Fix bug when actor creation task fails * remove imports	2019-01-14 11:09:15 +08:00
Robert Nishihara	27c20a41a9	Update stress tests. (#3614 ) Starts clusters for testing and has a fallback to kill the cluster if the command fails. The results are then printed at the end of test.	2019-01-13 17:08:51 -08:00
Eugene Vinitsky	a5d1f03515	[rllib] fix for rollout of lstm policies (#3643 ) * fix for lstm policies * added call to local evaluator * Update python/ray/rllib/rollout.py Co-Authored-By: eugenevinitsky <eugenevinitsky@users.noreply.github.com> * Update rollout.py * Update rollout.py	2019-01-13 15:54:23 -08:00
Philipp Moritz	00e9f8d870	Fix pyarrow version (#3760 )	2019-01-13 14:28:23 -08:00
jhpenger	3adffe6a4e	[docs] Add example showing how to use Ray on Kubernetes. (#3126 ) Closes #1353.	2019-01-13 13:56:47 -08:00
Wang Qing	8674606e26	Support to auto-generate Java files from flatbuffer (#3749 ) * auto gen flatbuffers for Java * Add auto_gen_tool.py * Refine * Add a comment * address comments. * Address comments. * Addressed * Refine * Address comments * Fix typo * Add exception * Address comments. * Refine * Fix lint * Fix * Fix lint and address comment. * Fix lint error	2019-01-13 11:39:23 -08:00
Yuhong Guo	d2cf8561f2	Refactor code about ray.ObjectID. (#3674 ) * Refactor code about ray.ObjectID. * remove from_random and use nil_id instead of constructor * remove id() in hash * Lint and fix * Change driver id to ObjectID * Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()	2019-01-13 01:47:29 -08:00
Eric Liang	c4b058739b	Remove redundant error message (#3761 )	2019-01-12 22:22:41 -08:00
Richard Liaw	bdeeacc70f	[autoscaler] RecoverUnhealthyWorker mitigation (#3699 ) Increases number of retries for RecoverUnhealthyWorkers Closes #3435.	2019-01-12 14:06:53 -08:00
Robert Nishihara	1480f309c3	[doc] Replace runtest.py with mini_test.py in documentation. (#3750 ) Rename `xray_test.py` to `mini_test.py` and use that in the documentation. Right now we suggest that people run `runtest.py`, but that often doesn't succeed and takes too long.	2019-01-12 14:05:28 -08:00
James Casbon	528bb3afd9	gcp allow manual network configuration (#3748 )	2019-01-12 14:02:20 -08:00
Robert Nishihara	fbea1ece2e	Clear new actor handle list after submitting task. (#3755 )	2019-01-12 23:25:40 +08:00
Wang Qing	0a556dc0b5	Refine redis client (#3758 )	2019-01-12 23:01:48 +08:00
Wang Qing	a0cf8ee5a8	Refine Java worker code (#3735 )	2019-01-12 22:45:33 +08:00
Robert Nishihara	8723d6b061	Define a Node class to manage Ray processes. (#3733 ) * Implement Node class and move most of services.py into it. * Wait for nodes as they are added to the cluster. * Fix Redis authentication bug. * Fix bug in client table ordering. * Address comments. * Kill raylet before plasma store in test. * Minor	2019-01-11 22:30:38 -08:00
Wang Qing	fa2bfa6d76	Fix some small code quality issues. (#3719 )	2019-01-11 15:24:49 +08:00
Stephanie Wang	cc5ecd71c5	[autoscaler] Add kill and get IP commands to CLI for testing (#3731 ) ## What do these changes do? Adds 2 commands to the CLI that take in an autoscaler config: 1. Kill a random ray node in the cluster. 2. Get all the worker node IP addresses. These commands are both for testing and are not recommended for normal use. ## Related issue number Closes #3685.	2019-01-10 22:06:57 -08:00
Richard Liaw	574f0b73bc	[tune] Fix Trial Serialization (#3743 )	2019-01-10 19:26:10 -08:00
Hao Chen	597abb24ea	Refine multi-threading support (#3672 ) * [Python] refine multi-threading support fix * [java] refine multithreading code fix java * format	2019-01-10 13:58:11 -08:00
Eric Liang	71243203a4	[rllib] Fix KeyError: 'kl' in multiagent ppo training	2019-01-09 19:33:07 -08:00
Hao Chen	6fc3fc4120	Cap task lease timeout (#3707 )	2019-01-09 17:19:48 -08:00
Richard Liaw	edb7aaf7c7	[tune] Better Serialization for Server (#3708 ) * Add cloudpickle for serialization * Fix tests	2019-01-09 11:55:32 -08:00
Stephanie Wang	04f31db54d	Actor dummy object garbage collection (#3593 ) * Convert UniqueID::nil() to a constructor * Cleanup actor handle pickling code * Add new actor handles to the task spec * Pass in new actor handles * Add new handles to the actor registration * Regression test for actor handle forking and GC * lint and doc * Handle pickled actor handles in the backend and some refactoring * Add regression test for dummy object GC and pickled actor handles * Check for duplicate actor tasks on submission * Regression test for forking twice, fix failed named actor leak * Fix bug for forking twice * lint * Revert "Fix bug for forking twice" This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac. * Add new actor handles when task is assigned, not finished * Remove comment * remove UniqueID() * Updates * update * fix * fix java * fixes * fix	2019-01-09 10:37:11 -08:00
Wenting Shen	3027dde303	Fix some storage problems of RayLog (#3595 ) 1. Fix the problem of duplicated stored logs. 2. Save log whose level is higher than severity_threshold, not only with severity_threshold. 3. Fix a `log_dir` bug: storing logs in a wrong path.	2019-01-09 13:54:21 +08:00
Robert Nishihara	d1e21b702e	Change timeout from milliseconds to seconds in ray.wait. (#3706 ) * Change timeout from milliseconds to seconds in ray.wait. * Suppress warning. * Suppress warning. * Add prominent warning in API documentation.	2019-01-08 21:32:08 -08:00
Si-Yuan	59d861281e	Bug fixing: Redis password should be used when reporting errors. (#3724 )	2019-01-08 21:23:55 -08:00
Robert Nishihara	6bbc667f93	Remove unused code path in services.py. (#3722 )	2019-01-08 19:57:16 -08:00
Peter Schafhalter	5945b92fd3	[sgd] Add checkpointing (#3638 )	2019-01-08 15:29:30 -08:00
Robert Nishihara	5e76d52868	Improve cluster.wait_for_nodes() API. (#3712 ) * Separate out functionality for querying client table and improve cluster.wait_for_nodes() API. * Linting * Add back logging statements. * info -> debug	2019-01-07 21:26:58 -08:00
Richard Liaw	33319502b6	[tune] Add a callable check for converting to trainable (#3711 )	2019-01-07 16:18:29 -08:00
Robert Nishihara	5dadac148c	Remove unused file. (#3695 )	2019-01-07 12:45:48 -08:00
Robert Nishihara	c9d70f0dda	Remove num_local_schedulers argument from ray.worker._init. (#3704 ) * Remove num_local_schedulers argument from ray.worker._init. * Fix * Fix tests.	2019-01-07 12:44:49 -08:00
Eric Liang	e78562b2e8	[rllib] Misc fixes: set lr for PG, better error message for LSTM/PPO, fix multi-agent/APEX (#3697 ) * fix * update test * better error * compute * eps fix * add get_policy() api * Update agent.py * better err msg * fix * pass in rew	2019-01-06 19:37:35 -08:00
Hao Chen	df0733cafb	Skip test_multiple_recursive (#3683 ) This test often hangs or fails in CI. Skip it for now to unblock other PRs.	2019-01-06 13:24:29 -08:00
Richard Liaw	8934e37a78	[tune] Change log handling for Tune (#3661 ) Also provides a small retry mechanism for a transient error as reported by #3340. Closes #3653.	2019-01-06 13:20:10 -08:00
mattearllongshot	681e8cd3fd	[autoscaler] Add an initial_workers option (#3530 ) ## What do these changes do? This option goes along with `min_workers`, and `max_workers`. When the cluster is first brought up (or when it is refreshed with a subsequent `ray up`) this number of nodes will be started. It's a workaround for issues of scaling (see related issues) where it can take a long time (or forever in the case where the head node has `--num-cpus 0`) to scale up a cluster in response to increasing demand. ## Related issue number Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106	2019-01-05 17:58:42 -08:00
Robert Nishihara	067976ad3d	Push a warning to all users when large number of workers have been started. (#3645 ) * Push a warning to all users when large number of workers have been started. * Add test. * Fix bug. * Give warning when worker starts instead of when worker registers. * Fix * Fix tests	2019-01-05 13:27:32 -08:00
Wang Qing	692fdc6bc3	[Java] Allow actor handle to be serialized without forking (#3686 )	2019-01-06 00:29:08 +08:00
Eric Liang	03fe760616	[rllib] Model self loss isn't included in all algorithms (#3679 )	2019-01-04 22:30:35 -08:00
Richard Liaw	960a943503	[tune] Fault Tolerance: handle lost checkpoints by restart (#3657 ) Checks that node failure with lost checkpoints does not crash. Also adds test.	2019-01-04 22:05:27 -08:00
Eric Liang	7db1f3be2a	[tune] resume=False by default but print a tip to set resume="prompt" + jenkins fix (#3681 )	2019-01-04 17:23:19 -08:00
Kristian Hartikainen	747b117929	[tune] Tweak/allow nested pbt mutations (#3455 ) * Fix warning text in pbt logger * Allow nested mutations in pbt by recursing explore function * Add test for nested pbt mutation * Update pbt explore to only call custom explore on top level * fix test	2019-01-04 13:51:11 -08:00
Robert Nishihara	cd80891ddb	Try to figure out the memory limit in a docker container. (#3605 ) * Try to figure out the memory limit in a docker container. * Update comment * Fix * Fix	2019-01-03 23:07:24 -08:00
Robert Nishihara	586a5c9ffa	Limit default redis max memory to 10GB. (#3630 ) * Limit Redis max memory to 10GB/shard by default. * Update stress tests. * Reorganize * Update * Add minimum cap size for object store and redis. * Small test update.	2019-01-03 13:23:54 -08:00
Yuhong Guo	4b23a34c93	Fix multi-thread problem of function manager and Jenkins test (#3648 )	2019-01-03 17:05:13 +08:00
Yuhong Guo	ad2287ebe9	Fix new boost libs failure in cache-lib mode and add test to cover collect_dependent_libs.sh (#3627 ) * Fix building breaks and add lib collection to Travis. * Fix arrow build * Fix version mismatch problem	2019-01-02 23:51:11 -08:00
Eric Liang	ca864faece	[rllib] Documentation for I/O API and multi-agent support / cleanup (#3650 )	2019-01-03 15:15:36 +08:00
opherlieber	2177e2f410	[rllib] Agent: Allow unknown subkeys for custom_resources_per_worker (#3639 ) * RLLib Agent: Allow unknown subkeys for custom_resources_per_worker * Update agent.py	2019-01-03 14:19:59 +08:00
Eric Liang	47d36d7bd6	[rllib] Refactor pytorch custom model support (#3634 )	2019-01-03 13:48:33 +08:00
Robert Nishihara	b6bcd18d65	Split profile table among many keys in the GCS. (#3676 ) * Divide profile table among many keys in GCS. * Fix, and remove --collect-profiling-data arg. * Remove reference in doc.	2019-01-02 21:33:01 -08:00

... 3 4 5 6 7 ...

2593 commits