Commit graph

1889 commits

Author SHA1 Message Date
Stephanie Wang
4a7be6f46d [xray] Make sure raylet does not crash if remote raylet dies (#2619)
* Log a warning on remote object manager failures

* Mark a task that was failed to be forwarded as pending

* Raylet component failure test and make it harder

* Turn on component failure test for xray

* Remove return status from ReleaseSender

* lint
2018-08-09 20:36:30 -07:00
Jones Wong
007208d2bb Support older version TF and Support RMSProp in Impala (#2590)
to support TF version < 1.5
to support rmsprop optimizer in Impala

Before TF1.5, tf.reduce_sum() and tf.reduce_max() has an argument keep_dims which has been renamed as keepdims in later versions.

In the original paper of Impala, they use rmsprop algorithm to optimize the model. We'd better also support it so that users can reproduce their experiments. Without any tuning, say that using the same hyper-parameters as AdamOptimizer, it reaches "episode_reward_mean": 19.083333333333332 in Pong after consume 3,610,350 samples.
2018-08-09 19:51:32 -07:00
Hao Chen
170e08cf02 fix a bug in killing unregistered workers (#2613) 2018-08-09 17:57:25 -07:00
Philipp Moritz
143a118fbf [xray] Fix valgrind crash when memory profiling raylet (#2583)
* use different random number generator to be compatible with older valgrind versions

* seed from time

* style

* fix

* remove more random devices

* also remove random_device from global scheduler

* rename mutex

* linting
2018-08-09 15:37:17 -07:00
Stephanie Wang
f093ed1fc6 [xray] Fix crash in case of spurious reconstruction (#2609)
* Exit if task already queued

* address comments
2018-08-09 14:46:46 -07:00
Stephanie Wang
2de9bfc7e3 [xray] Log warnings for asio handlers that take too long (#2601)
* Add fatal check for heartbeat drift

* Log warning messages for handlers that take too long

* Add debug labels to all ClientConnections
2018-08-09 14:39:23 -07:00
Stephanie Wang
d49b4bef0a [xray] Basic task reconstruction mechanism (#2526)
## What do these changes do?

This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.

Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).

Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
2018-08-09 07:24:37 -07:00
Melih Elibol
8ae82180b4 [xray] Adds a driver table. (#2289)
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.

Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
2018-08-08 23:41:40 -07:00
Alexey Tumanov
df7ee7ff1e raylet memory corruption fixes (#2591)
* raylet memory corruption fixes

* add util function to translate boost error to ray status

* tcp client connection now using ray status utility function

* lint
2018-08-08 19:50:43 -07:00
Stephanie Wang
6ab01a2cad [xray] Fix bug when counting a task's lineage size (#2600) 2018-08-08 00:00:17 -07:00
Ujval Misra
a0691ee49b [xray] Prevent sending excessive uncommitted lineage on task forwarding (#2534)
* Add set to lineage cache entry to track nodes already forwarded to.

* Uncommitted lineage function naming, documentation.

* Simple test for uncommitted lineage with a marked task.

* Rebased, changed tests to use ClientID::nil.

* Bug fix, change MergeLineageHelper function type.

* Formatting.

* Checks and test changes based on PR comments.

* GetUncommittedLineage now always returns at least the requested task ID.

* Bug fix (return at least requested task ID)

* Formatting
2018-08-07 21:10:23 -07:00
Eric Liang
64053278aa
[tune] Support lambda functions in hyperparameters / tune rllib multiagent support (#2568)
* update

* func

* Update registry.py

* revert
2018-08-07 16:29:21 -07:00
Philipp Moritz
e7f76d7914 [xray] Fix typo concerning heartbeat_timeout_milliseconds in monitor (#2586) 2018-08-07 13:45:51 -07:00
Richard Liaw
bb44456f6f
[rllib, tune] TrainingResult -> Dict, Removes C408 from flake8 (#2565) 2018-08-07 12:17:44 -07:00
Philipp Moritz
a3202f581c [xray] Add flag to start raylet in valgrind (#2582) 2018-08-07 11:25:21 -07:00
Philipp Moritz
25f0094ee4 Fix copying the plasma fbs directory from arrow (#2579) 2018-08-07 00:04:37 -07:00
Yuhong Guo
d35ce7fa63 Use real callback index in subscribe_callback_index_ (#2473) 2018-08-06 15:29:56 -07:00
Yuhong Guo
9825da7233 Change training tasks to xray for Jenkins tests (#2567) 2018-08-06 13:35:26 -07:00
Alexey Tumanov
85b8b2a395 mark all remaining placeable tasks pending with task dependency manager (#2528) 2018-08-06 13:08:11 -07:00
Eric Liang
981d9818c1
[rllib] Support the timesteps_per_batch in simple optimizer PPO mode (#2558)
* support ts

* doc

* Update sync_samples_optimizer.py
2018-08-06 12:10:59 -07:00
Mitar
9015e742c4 Update installation instructions with psmisc to enable 'ray stop' (#2550) 2018-08-05 23:58:58 -07:00
Wang Qing
3845c294c3 [java] Fix java raylet wait (#2553) 2018-08-05 23:49:54 -07:00
Melih Elibol
34d3a46f48 [xray] Revert dynamic chunk size optimization for ObjectManager. (#2557)
* Revert dynamic chunk size optimization.

* fix mac build issues.
2018-08-05 02:09:37 -07:00
Richard Liaw
914a433e3f
[tune] Split Search from Scheduling (#2452)
Introduces SearchAlgorithm concept, separate from schedulers in Tune. Moves HyperOpt under this concept.
2018-08-04 21:27:39 -07:00
Eric Liang
9449d07eca
[rllib] Fix crash when setting horizon in multiagent
If a horizon is set, an env terminates without done=True.
2018-08-03 16:37:56 -07:00
Philipp Moritz
d5dda1ebf2 copy all files when installing pyarrow (#2547) 2018-08-02 17:06:37 -07:00
Philipp Moritz
5e59cc6a20 Update arrow to include plasma memory footprint reduction (#2545) 2018-08-02 14:37:37 -07:00
Peter Schafhalter
7a5f25248e [rllib] Improve conv_filters documentation (#2540)
* Improve conv_filters documentation

* Update catalog.py

* Update catalog.py
2018-08-02 14:29:40 -07:00
Eric Liang
f7ec292360
[rllib] Support agent.get_action in multiagent (#2543)
* support get action on policy id

* comment

* grammar fixes

* Update rllib-algorithms.rst
2018-08-02 13:35:53 -07:00
Yuhong Guo
d2ebe4d9a3 Fix frequent failure of Jenkins CI. (#2490) 2018-08-02 10:28:28 -07:00
Philipp Moritz
d8ba667175 Convert asserts in unittest to pytest (#2529) 2018-08-01 22:32:10 -07:00
Eric Liang
9ea57c2a93
[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504)
Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
  Add AsyncSamplesOptimizer that implements the IMPALA architecture
  integrate V-trace with a3c policy graph
  audit V-trace integration
  benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
2018-08-01 20:53:53 -07:00
Wang Qing
e4f68ff8cf [Java Worker] Support raylet on Java (#2479) 2018-08-01 17:52:49 -07:00
Eric Liang
9a479b3a63
[rllib] Document creating an ensemble of envs; also add vector_index attribute to env config (#2513)
This also removes the async resetting code in VectorEnv. While that improves benchmark performance slightly, it substantially complicates env configuration and probably isn't worth it for most envs.

This makes it easy to efficiently support setups like Joint PPO: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/retro-contest/gotta_learn_fast_report.pdf

For example, for 188 envs, you could do something like num_envs: 10, num_envs_per_worker: 19.
2018-08-01 16:29:27 -07:00
Eric Liang
a630e332f3
[rllib] Don't use get_gpu_ids() in ppo
This lets the num_gpus config work properly even when not using tune, since the gpu ids won't be set by ray in that case.
2018-08-01 16:25:11 -07:00
Eric Liang
d9a36c4e39
[rllib] Document auto-concat in a3c (#2533)
* docs

* update hyperparm docs
2018-08-01 15:11:30 -07:00
Zhijun Fu
ca36827f01 [Issues 2403][xray] Fix raylet performance issues on scheduling queue (#2438)
* merge from ray
* Revert "merge from ray"
This reverts commit 32b181ebbb1fa184026631e1a7368112c4c3118d.
* fix raylet performance regression
* address comments
* Update code after merging latest changes
* fix lint
* address comments
2018-08-01 14:41:20 -07:00
Melih Elibol
89f60e39f3
Override user-specified name tag. (#2480)
Override user-specified name tag.
2018-08-01 14:16:57 -04:00
Stephanie Wang
e90ecef297 [xray] Try to flush children of a task that is evicted from the lineage cache (#2531) 2018-08-01 00:23:02 -07:00
Robert Nishihara
909d7172b1 Introduce constant for ID_SIZE in python code. (#2517) 2018-07-31 12:40:53 -07:00
mehrdadn
64d00ff39e Remove Visual Studio projects (#2525) 2018-07-31 10:22:24 -07:00
Philipp Moritz
d9a019b8e5 Upgrade arrow to include pytorch fix (#2522)
This fixes https://github.com/ray-project/ray/issues/2520
2018-07-30 20:20:18 -07:00
Stephanie Wang
a45f9cfafc [xray] Implement task lease table, logic for deciding when to reconstruct a task (#2497) 2018-07-30 14:42:28 -07:00
Eric Liang
38d00986a5
[rllib] Cleanups: deep merge configs properly; enforce min iter time on APEX (#2500)
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
2018-07-30 13:25:35 -07:00
Eric Liang
62a52ee989
[rllib] Fix corner case in rnn episode handling
We should use episode ids instead of the timestep to determine when sequences should be cut, since when batches are concatenated, increasing t does not guarantee we are part of the same episode.
2018-07-30 13:24:43 -07:00
Philipp Moritz
696a229ece Fix text verbosity in python 2.7 by running tests with pytest (#2470) 2018-07-30 11:04:06 -07:00
Hao Chen
fe65f9fbbc improve java api doc (#2508) 2018-07-29 20:41:11 -07:00
Robert Nishihara
3f3514c2b3 Deprecate PYTHON_MODE more gracefully. (#2487) 2018-07-29 16:25:46 -07:00
Steve Severance
f1b4ea69a3 Prevent hasher from running out of memory on large files (#2451)
* Prevent hasher from running out of memory on large files

* dump out keys

* only print if failed

* remove debugging

* Fix lint error. Reverse adding newline.
2018-07-28 23:29:09 -07:00
Ion
80db69d245 State transition diagram documentation. (#2502)
* Added description of transition diagram and a few name changes for imporved clarity.

* rename some methods and update task_states.rst
2018-07-28 22:28:45 -07:00