hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 12:56:46 -04:00

Author	SHA1	Message	Date
Eric Liang	079c4e482a	ray exec and ray attach commands (#2560 ) ray exec CLUSTER CMD [--screen] [--start] [--stop] ray attach CLUSTER [--start] Example: ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.	2018-08-15 14:31:50 -07:00
Eric Liang	53f9755594	[rllib] Fix support for mixed discrete and continuous action spaces, add to regression test (#2655 ) * fix * lint * fix	2018-08-15 10:19:41 -07:00
tianyapiaozi	98fed67b45	fix offset by one issue in the local scheduler (#2652 )	2018-08-15 10:10:30 -07:00
Hao Chen	3c75e71afc	reduce noisy log messages from wget (#2656 )	2018-08-15 09:10:28 -07:00
Yuhong Guo	eeb15771ba	Add `ray.internal.free` (#2542 )	2018-08-14 22:01:23 -07:00
Philipp Moritz	f13e3e22f2	Upgrade arrow to include tensorflow op fix (#2607 )	2018-08-14 21:47:01 -07:00
Stephanie Wang	62649715ca	[xray] Cache a task's object dependencies (#2623 ) * Cache a Task's object dependencies * Cache the parent task IDs for lineage cache entries * Cache the parent task IDs in lineage cache entries * revert * Fix test * remove unused line * Fix test	2018-08-14 20:25:41 -07:00
Stephanie Wang	dede80f3df	[xray] Reduce fatal checks in the lineage cache that fail during reconstruction (#2642 ) * Loosen checks in the lineage cache and log appropriate warnings in the node manager * revert test	2018-08-14 15:25:32 -07:00
Yuhong Guo	4bd98eed45	Support building Java and Python version at the same time. (#2640 ) * Support building Java and Python version at the same time. * Remove duplicated definition. * Refine the building process of local_scheduler * Refine * Add comment for languages * Modify instruction and add python,jave building to CI. * change according to comment	2018-08-14 11:33:51 -07:00
Mitar	493585574a	Updating documentation. (#2643 )	2018-08-13 19:18:12 -07:00
Stephanie Wang	806fdf2f05	[xray] Object manager retries Pull requests (#2630 ) * Move all ObjectManager members to bottom of class def * Better Pull requests - suppress duplicate Pulls - retry the Pull at the next client after a timeout - cancel a Pull if the object no longer appears on any clients * increase object manager Pull timeout * Make the component failure test harder. * note * Notify SubscribeObjectLocations caller of empty list * Address melih's comments * Fix wait... * Make component failure test easier for legacy ray * lint	2018-08-13 19:15:55 -07:00
efang96	baba624373	updated agent.compute_action to return rnn state (#2581 ) * updated agent.compute_action to return rnn state * updated compute_action method, added case for state=None * fixing lint	2018-08-13 18:04:42 -07:00
Mitar	8769b8ac32	Fixing docstring. (#2638 )	2018-08-13 16:19:32 -07:00
Eric Liang	9559873d13	[rllib] tuple space shouldn't assume elements are all the same size (#2637 ) * fix * lint	2018-08-11 10:57:40 -07:00
Peter Schafhalter	230b9ab33b	[asv] Add benchmark for ray.wait (#2625 ) * Add benchmarks for ray.wait * Fix bug	2018-08-10 17:52:36 -07:00
Wang Qing	244337d381	[java] Support resources management in raylet mode. (#2606 )	2018-08-10 12:44:18 -07:00
Stephanie Wang	4a7be6f46d	[xray] Make sure raylet does not crash if remote raylet dies (#2619 ) * Log a warning on remote object manager failures * Mark a task that was failed to be forwarded as pending * Raylet component failure test and make it harder * Turn on component failure test for xray * Remove return status from ReleaseSender * lint	2018-08-09 20:36:30 -07:00
Jones Wong	007208d2bb	Support older version TF and Support RMSProp in Impala (#2590 ) to support TF version < 1.5 to support rmsprop optimizer in Impala Before TF1.5, tf.reduce_sum() and tf.reduce_max() has an argument keep_dims which has been renamed as keepdims in later versions. In the original paper of Impala, they use rmsprop algorithm to optimize the model. We'd better also support it so that users can reproduce their experiments. Without any tuning, say that using the same hyper-parameters as AdamOptimizer, it reaches "episode_reward_mean": 19.083333333333332 in Pong after consume 3,610,350 samples.	2018-08-09 19:51:32 -07:00
Hao Chen	170e08cf02	fix a bug in killing unregistered workers (#2613 )	2018-08-09 17:57:25 -07:00
Philipp Moritz	143a118fbf	[xray] Fix valgrind crash when memory profiling raylet (#2583 ) * use different random number generator to be compatible with older valgrind versions * seed from time * style * fix * remove more random devices * also remove random_device from global scheduler * rename mutex * linting	2018-08-09 15:37:17 -07:00
Stephanie Wang	f093ed1fc6	[xray] Fix crash in case of spurious reconstruction (#2609 ) * Exit if task already queued * address comments	2018-08-09 14:46:46 -07:00
Stephanie Wang	2de9bfc7e3	[xray] Log warnings for asio handlers that take too long (#2601 ) * Add fatal check for heartbeat drift * Log warning messages for handlers that take too long * Add debug labels to all ClientConnections	2018-08-09 14:39:23 -07:00
Stephanie Wang	d49b4bef0a	[xray] Basic task reconstruction mechanism (#2526 ) ## What do these changes do? This implements basic task reconstruction in raylet. There are two parts to this PR: 1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary. 2. Task resubmission once a raylet becomes responsible for reconstructing a task. Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this: 1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR. 2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted). Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.	2018-08-09 07:24:37 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Alexey Tumanov	df7ee7ff1e	raylet memory corruption fixes (#2591 ) * raylet memory corruption fixes * add util function to translate boost error to ray status * tcp client connection now using ray status utility function * lint	2018-08-08 19:50:43 -07:00
Stephanie Wang	6ab01a2cad	[xray] Fix bug when counting a task's lineage size (#2600 )	2018-08-08 00:00:17 -07:00
Ujval Misra	a0691ee49b	[xray] Prevent sending excessive uncommitted lineage on task forwarding (#2534 ) * Add set to lineage cache entry to track nodes already forwarded to. * Uncommitted lineage function naming, documentation. * Simple test for uncommitted lineage with a marked task. * Rebased, changed tests to use ClientID::nil. * Bug fix, change MergeLineageHelper function type. * Formatting. * Checks and test changes based on PR comments. * GetUncommittedLineage now always returns at least the requested task ID. * Bug fix (return at least requested task ID) * Formatting	2018-08-07 21:10:23 -07:00
Eric Liang	64053278aa	[tune] Support lambda functions in hyperparameters / tune rllib multiagent support (#2568 ) * update * func * Update registry.py * revert	2018-08-07 16:29:21 -07:00
Philipp Moritz	e7f76d7914	[xray] Fix typo concerning heartbeat_timeout_milliseconds in monitor (#2586 )	2018-08-07 13:45:51 -07:00
Richard Liaw	bb44456f6f	[rllib, tune] TrainingResult -> Dict, Removes C408 from flake8 (#2565 )	2018-08-07 12:17:44 -07:00
Philipp Moritz	a3202f581c	[xray] Add flag to start raylet in valgrind (#2582 )	2018-08-07 11:25:21 -07:00
Philipp Moritz	25f0094ee4	Fix copying the plasma fbs directory from arrow (#2579 )	2018-08-07 00:04:37 -07:00
Yuhong Guo	d35ce7fa63	Use real callback index in subscribe_callback_index_ (#2473 )	2018-08-06 15:29:56 -07:00
Yuhong Guo	9825da7233	Change training tasks to xray for Jenkins tests (#2567 )	2018-08-06 13:35:26 -07:00
Alexey Tumanov	85b8b2a395	mark all remaining placeable tasks pending with task dependency manager (#2528 )	2018-08-06 13:08:11 -07:00
Eric Liang	981d9818c1	[rllib] Support the timesteps_per_batch in simple optimizer PPO mode (#2558 ) * support ts * doc * Update sync_samples_optimizer.py	2018-08-06 12:10:59 -07:00
Mitar	9015e742c4	Update installation instructions with psmisc to enable 'ray stop' (#2550 )	2018-08-05 23:58:58 -07:00
Wang Qing	3845c294c3	[java] Fix java raylet wait (#2553 )	2018-08-05 23:49:54 -07:00
Melih Elibol	34d3a46f48	[xray] Revert dynamic chunk size optimization for ObjectManager. (#2557 ) * Revert dynamic chunk size optimization. * fix mac build issues.	2018-08-05 02:09:37 -07:00
Richard Liaw	914a433e3f	[tune] Split Search from Scheduling (#2452 ) Introduces SearchAlgorithm concept, separate from schedulers in Tune. Moves HyperOpt under this concept.	2018-08-04 21:27:39 -07:00
Eric Liang	9449d07eca	[rllib] Fix crash when setting horizon in multiagent If a horizon is set, an env terminates without done=True.	2018-08-03 16:37:56 -07:00
Philipp Moritz	d5dda1ebf2	copy all files when installing pyarrow (#2547 )	2018-08-02 17:06:37 -07:00
Philipp Moritz	5e59cc6a20	Update arrow to include plasma memory footprint reduction (#2545 )	2018-08-02 14:37:37 -07:00
Peter Schafhalter	7a5f25248e	[rllib] Improve conv_filters documentation (#2540 ) * Improve conv_filters documentation * Update catalog.py * Update catalog.py	2018-08-02 14:29:40 -07:00
Eric Liang	f7ec292360	[rllib] Support agent.get_action in multiagent (#2543 ) * support get action on policy id * comment * grammar fixes * Update rllib-algorithms.rst	2018-08-02 13:35:53 -07:00
Yuhong Guo	d2ebe4d9a3	Fix frequent failure of Jenkins CI. (#2490 )	2018-08-02 10:28:28 -07:00
Philipp Moritz	d8ba667175	Convert asserts in unittest to pytest (#2529 )	2018-08-01 22:32:10 -07:00
Eric Liang	9ea57c2a93	[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504 ) Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer Add AsyncSamplesOptimizer that implements the IMPALA architecture integrate V-trace with a3c policy graph audit V-trace integration benchmark compare vs A3C and with V-trace on/off PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.	2018-08-01 20:53:53 -07:00
Wang Qing	e4f68ff8cf	[Java Worker] Support raylet on Java (#2479 )	2018-08-01 17:52:49 -07:00
Eric Liang	9a479b3a63	[rllib] Document creating an ensemble of envs; also add vector_index attribute to env config (#2513 ) This also removes the async resetting code in VectorEnv. While that improves benchmark performance slightly, it substantially complicates env configuration and probably isn't worth it for most envs. This makes it easy to efficiently support setups like Joint PPO: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/retro-contest/gotta_learn_fast_report.pdf For example, for 188 envs, you could do something like num_envs: 10, num_envs_per_worker: 19.	2018-08-01 16:29:27 -07:00

... 3 4 5 6 7 ...

2105 commits