Commit graph

1931 commits

Author SHA1 Message Date
Philipp Moritz
b4c47a5861 Upgrade arrow to include more detailed flushing message (#2706) 2018-08-24 11:44:04 -07:00
Robert Nishihara
e467f546b5 Upgrade version of anaconda. (#2730) 2018-08-23 19:14:39 -07:00
Eric Liang
aa014af85b
[rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700)
Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments
2018-08-23 17:49:10 -07:00
Stephanie Wang
1b3de31ff1 [xray] Fix bug where driver task ID is assumed to be nil (#2725)
## What do these changes do?

#2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.
2018-08-23 14:44:47 -07:00
Yuhong Guo
344a83f327 Fix build failure of Arrow and Parquet when the folder is empty. (#2720) 2018-08-23 09:44:26 -07:00
Yuhong Guo
eec1a3eb89 Support pluggable backend log lib with glog (#2695)
* [WIP] Support different backend log lib

* Refine code, unify level, address comment

* Address comment and change formatter

* Fix linux building failure.

* Fix lint

* Remove log4cplus.

* Add log init to raylet main and add test to travis.

* Address comment and refine.

* Update logging_test.cc
2018-08-23 09:43:38 -07:00
old-bear
4be324efc3 [tune] Support infinity value in report result (#2693)
* + Compatibility fix under py2 on ray.tune

* + Revert changes on master branch

* + Use default JsonEncoder in ray.tune.logger

* + Add UT for infinity support
2018-08-22 13:09:14 -07:00
joyyoj
38867eea4e [tune] Cross-Framework Compatibility (#2646)
This commit is a first pass at restructuring the Trial execution logic to support running on multiple frameworks.
2018-08-22 10:55:45 -07:00
Eric Liang
fbe6c59f72
[rllib] Misc fixes, A2C (#2679)
A bunch of minor rllib fixes:

pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
2018-08-20 15:28:03 -07:00
Yucong He
880ef1bd21 doc fix (#2696) 2018-08-20 14:11:32 -07:00
Robert Nishihara
89d4a6df93 Start Redis in protected mode when started via ray.init(). (#2697)
This PR makes it so that when Ray is started via ray.init() (as opposed to via ray start) the Redis servers will be started in "protected mode" (which means that clients can only connect by connecting to localhost).

In practice, we actually connect redis clients by passing in the node IP address (not localhost), so I need to create a redis config file on the fly to allow both localhost and the node's actual IP address (it would have been nice to find a way to do this from the Python redis client, but I couldn't find one).
2018-08-20 14:08:01 -07:00
Stephanie Wang
8fd5757aaa [xray] Don't process any more messages from dead node managers (#2688) 2018-08-19 21:11:40 -07:00
old-bear
230ac7aa80 [tune] Compatibility fix under py2 on str condition (#2673)
* * Compatibility fix under py2 on ray.tune

* + Fix compatibility

* + Use package six to achieve str compatibility
2018-08-19 20:43:03 -07:00
Eric Liang
9473da69bd
[autoscaler] Experimental support for local / on-prem clusters (#2678)
This adds some experimental (undocumented) support for launching Ray on existing nodes. You have to provide the head ip, and the list of worker ips.

There are also a couple additional utils added for rsyncing files and port-forward.
2018-08-19 12:43:04 -07:00
Richard Liaw
62d0698097
[tune] Tune Facelift (#2472)
This PR introduces the following changes:

 * Ray Tune -> Tune 
 * [breaking] Creation of `schedulers/`, moving PBT, HyperBand into a submodule
 * [breaking] Search Algorithms now must take in experiment configurations via `add_configurations` rather through initialization
 * Support `"run": (function | class | str)` with automatic registering of trainable
 * Documentation Changes
2018-08-19 11:00:55 -07:00
Hao Chen
78b6bfb7f9 [Java] Change log dir to /tmp/raylogs (#2677)
Currently, log directory in Java is a relative path . This PR changes it to `/tmp/raylogs` (with the same format as Python, e.g., `local_scheduler-2018-51-17_17-8-6-05164.err`). It also cleans up some relative code.
2018-08-18 23:46:36 -07:00
Eric Liang
e56eb354eb
[tune] Remove hack to serve pin requests off thread (#2680)
* nopin

* fix
2018-08-18 13:19:52 -07:00
Robert Nishihara
aaf5456b3d Add test that tasks sent to actor on dead node raise exceptions. (#2626)
* Add actor failure test.

* Minor change.

* Make test harder.

* Change numbers a bit.

* Skip test for non xray.
2018-08-16 22:48:31 -07:00
Wang Qing
06a58016d8 [multi-language part 2] Change the command line arguments to start raylet (#2670) 2018-08-16 21:59:44 -07:00
Hao Chen
a719e089b0 [multi-language part 1] add a 'language' field to task specification (#2639) 2018-08-16 21:26:42 -07:00
Eric Liang
6670880f03
[rllib] Workaround actor creation hang edge case for ape-X (#2661)
* apex hang

* fix

* move pyt to end
2018-08-16 18:03:50 -07:00
Eric Liang
5f430da180
[rllib] Provide internal access to episode state in compute_actions() and allow returning extra batches (#2559)
The goal of this PR is to allow custom policies to perform model-based rollouts. In the multi-agent setting, this requires access to not only policies of other agents, but also their current observations.
Also, you might want to return the model-based trajectories as part of the rollout for efficiency.

  compute_actions() now takes a new keyword arg episodes
  pull out internal episode class into a top-level file
  add function to return extra trajectories from an episode that will be appended to the sample batch
  documentation
2018-08-16 14:37:21 -07:00
Eric Liang
127cf291a3
Delete __init__.py (#2668) 2018-08-16 02:01:21 -07:00
Stephanie Wang
e3e0cfce87 [xray] Resubmit tasks that fail to be forwarded (#2645) 2018-08-16 00:12:56 -07:00
Hao Chen
dd924a388b silence progress log from 'git clone' and 'pip install' (#2667) 2018-08-15 22:54:35 -07:00
Philipp Moritz
6cb6dd30d1 silence shutdown callback (#2662) 2018-08-15 22:48:00 -07:00
Eric Liang
079c4e482a
ray exec and ray attach commands (#2560)
ray exec CLUSTER CMD [--screen] [--start] [--stop]
ray attach CLUSTER [--start]

Example:
ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop

This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.
2018-08-15 14:31:50 -07:00
Eric Liang
53f9755594
[rllib] Fix support for mixed discrete and continuous action spaces, add to regression test (#2655)
* fix

* lint

* fix
2018-08-15 10:19:41 -07:00
tianyapiaozi
98fed67b45 fix offset by one issue in the local scheduler (#2652) 2018-08-15 10:10:30 -07:00
Hao Chen
3c75e71afc reduce noisy log messages from wget (#2656) 2018-08-15 09:10:28 -07:00
Yuhong Guo
eeb15771ba Add ray.internal.free (#2542) 2018-08-14 22:01:23 -07:00
Philipp Moritz
f13e3e22f2 Upgrade arrow to include tensorflow op fix (#2607) 2018-08-14 21:47:01 -07:00
Stephanie Wang
62649715ca [xray] Cache a task's object dependencies (#2623)
* Cache a Task's object dependencies

* Cache the parent task IDs for lineage cache entries

* Cache the parent task IDs in lineage cache entries

* revert

* Fix test

* remove unused line

* Fix test
2018-08-14 20:25:41 -07:00
Stephanie Wang
dede80f3df [xray] Reduce fatal checks in the lineage cache that fail during reconstruction (#2642)
* Loosen checks in the lineage cache and log appropriate warnings in the node manager

* revert test
2018-08-14 15:25:32 -07:00
Yuhong Guo
4bd98eed45 Support building Java and Python version at the same time. (#2640)
* Support building Java and Python version at the same time.

* Remove duplicated definition.

* Refine the building process of local_scheduler

* Refine

* Add comment for languages

* Modify instruction and add python,jave building to CI.

* change according to comment
2018-08-14 11:33:51 -07:00
Mitar
493585574a Updating documentation. (#2643) 2018-08-13 19:18:12 -07:00
Stephanie Wang
806fdf2f05 [xray] Object manager retries Pull requests (#2630)
* Move all ObjectManager members to bottom of class def

* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients

* increase object manager Pull timeout

* Make the component failure test harder.

* note

* Notify SubscribeObjectLocations caller of empty list

* Address melih's comments

* Fix wait...

* Make component failure test easier for legacy ray

* lint
2018-08-13 19:15:55 -07:00
efang96
baba624373 updated agent.compute_action to return rnn state (#2581)
* updated agent.compute_action to return rnn state

* updated compute_action method, added case for state=None

* fixing lint
2018-08-13 18:04:42 -07:00
Mitar
8769b8ac32 Fixing docstring. (#2638) 2018-08-13 16:19:32 -07:00
Eric Liang
9559873d13
[rllib] tuple space shouldn't assume elements are all the same size (#2637)
* fix

* lint
2018-08-11 10:57:40 -07:00
Peter Schafhalter
230b9ab33b [asv] Add benchmark for ray.wait (#2625)
* Add benchmarks for ray.wait

* Fix bug
2018-08-10 17:52:36 -07:00
Wang Qing
244337d381 [java] Support resources management in raylet mode. (#2606) 2018-08-10 12:44:18 -07:00
Stephanie Wang
4a7be6f46d [xray] Make sure raylet does not crash if remote raylet dies (#2619)
* Log a warning on remote object manager failures

* Mark a task that was failed to be forwarded as pending

* Raylet component failure test and make it harder

* Turn on component failure test for xray

* Remove return status from ReleaseSender

* lint
2018-08-09 20:36:30 -07:00
Jones Wong
007208d2bb Support older version TF and Support RMSProp in Impala (#2590)
to support TF version < 1.5
to support rmsprop optimizer in Impala

Before TF1.5, tf.reduce_sum() and tf.reduce_max() has an argument keep_dims which has been renamed as keepdims in later versions.

In the original paper of Impala, they use rmsprop algorithm to optimize the model. We'd better also support it so that users can reproduce their experiments. Without any tuning, say that using the same hyper-parameters as AdamOptimizer, it reaches "episode_reward_mean": 19.083333333333332 in Pong after consume 3,610,350 samples.
2018-08-09 19:51:32 -07:00
Hao Chen
170e08cf02 fix a bug in killing unregistered workers (#2613) 2018-08-09 17:57:25 -07:00
Philipp Moritz
143a118fbf [xray] Fix valgrind crash when memory profiling raylet (#2583)
* use different random number generator to be compatible with older valgrind versions

* seed from time

* style

* fix

* remove more random devices

* also remove random_device from global scheduler

* rename mutex

* linting
2018-08-09 15:37:17 -07:00
Stephanie Wang
f093ed1fc6 [xray] Fix crash in case of spurious reconstruction (#2609)
* Exit if task already queued

* address comments
2018-08-09 14:46:46 -07:00
Stephanie Wang
2de9bfc7e3 [xray] Log warnings for asio handlers that take too long (#2601)
* Add fatal check for heartbeat drift

* Log warning messages for handlers that take too long

* Add debug labels to all ClientConnections
2018-08-09 14:39:23 -07:00
Stephanie Wang
d49b4bef0a [xray] Basic task reconstruction mechanism (#2526)
## What do these changes do?

This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.

Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).

Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
2018-08-09 07:24:37 -07:00
Melih Elibol
8ae82180b4 [xray] Adds a driver table. (#2289)
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.

Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
2018-08-08 23:41:40 -07:00