Commit graph

1294 commits

Author SHA1 Message Date
Ion
3c32343c63 Ray signal (#3624) 2019-02-11 10:14:48 -08:00
Zhijun Fu
7097ba393b protect raylet against bad messages (#4003)
* protect raylet against bad messages

* address comments

* linting and regression test
2019-02-12 00:39:38 +08:00
Philipp Moritz
ab809bd927 update ray version to 0.7.0dev (#3995) 2019-02-10 19:56:42 -08:00
Eric Liang
8e9f2c923f
[autoscaler] Use RLock in addition to FileLock 2019-02-10 19:16:43 -08:00
Yuhong Guo
5fb1efd60d Fix CI test failures (#4007) 2019-02-11 11:01:14 +08:00
bjg2
e703b9f49d [wingman -> rllib] Improved stats changes in AsyncSamplesOptimizer (#3966)
* added stats changes to optimizer

* changes timers

* fix python 2 compat

* improved optimizer throughput stats

* Update async_samples_optimizer.py

* fix python2 compat
2019-02-10 01:25:22 -08:00
Eric Liang
29322c7389
[rllib] Replay buffer for IMPALA should default to 0 slots. (#3971)
* disable replay

* make lq configurable

* leak test

* Update run_multi_node_tests.sh
2019-02-08 10:03:11 -08:00
Robert Nishihara
6a32b410bb Update versions from 0.6.2 -> 0.6.3 in the documentation. (#3981) 2019-02-07 20:57:37 -08:00
Robert Nishihara
ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Philipp Moritz
0aa74fb1fd Update cloudpickle to 0.8.0.dev0 (#3964) 2019-02-07 15:24:06 -08:00
Eric Liang
ae4bc7d6e8
[revert] [rllib] Add copy() in async samples optimizer 2019-02-07 14:14:39 -08:00
markgoodhead
5ce670cb36 [tune] Add Initial Parameter Suggestion for HyperOpt (#3944)
Allows users of the HyperOptSearch suggestion algorithm to specify initial experiment values to run (typically already known good baseline parameters within the domain specified)
2019-02-07 10:57:51 -08:00
Richard Liaw
5db1afef07
[tune] Support Custom Resources (#2979)
Support arbitrary resource declarations in Tune.

Fixes https://github.com/ray-project/ray/issues/2875
2019-02-07 00:29:19 -08:00
Stephanie Wang
d2b6db3db1
Bump version from 0.6.2 to 0.6.3 (#3972) 2019-02-06 19:11:16 -08:00
Eric Liang
04fc145a44 [autoscaler] Autoscaler hangs forever on non-zero exit code command (#3969) 2019-02-06 17:25:24 -08:00
Robert Nishihara
fa4eb8313d Suppress warning for serializing different unique ID types in Python. (#3872)
* Suppress warning for serializing different unique ID types in Python.

* Add _ID_TYPES variable.
2019-02-05 11:38:33 -08:00
vfdev
b2b8417790 [tune] Improve mnist_pytorch.py example (#3894)
## What do these changes do?

* Improved --no-cuda handling
* Removed deprecated Variable usage


## Related issue number

Fixes #3873 
<!-- Are there any issues opened that will be resolved by merging this change? -->
2019-02-04 17:59:54 -08:00
William Ma
f067223c4a Allow Ray processes to be started inside of gdb and tmux. (#3847) 2019-02-04 15:23:39 -08:00
Wang Qing
e1c68a0881 Enable including Java worker for ray start command (#3838) 2019-02-04 16:23:43 +08:00
Eric Liang
7ef830bef1 [rllib] Add copy() in async samples optimizer to fix memory leak (#3938)
Fixes #3884.
2019-02-03 18:34:37 -08:00
Andrew Tan
8323419a6d [tune] Add SigOpt Integration (#3844) 2019-02-03 18:23:57 -08:00
Kristian Hartikainen
85294fb503 [autoscaler] node caching changes (#3937)
Breaks the node provider node getter into cached and non-cached versions.

Fixes #3930 by updating the node label finger print before updating labels.
Fixes #3935 by refreshing node cache if node ip is not found.
2019-02-03 17:48:07 -08:00
James Casbon
976f018dab [autoscaler] GCP: only call setIamPolicy if necessary (#3782) 2019-02-03 16:16:00 -08:00
James Casbon
b8cc176b4d [autoscaler] Document gcp subnet config (#3783)
Adds info to the gcp example yaml on using shared subnets.
2019-02-03 16:14:44 -08:00
Si-Yuan
9295ab8f60 Various Python code cleanups. (#3837) 2019-02-03 10:16:24 -08:00
Michael Luo
1a015e420b Optimal PPO Configs (10k reward in 1 hr) + PPO grad clipping implemented (#3934) 2019-02-02 22:10:58 -08:00
Richard Liaw
eab6dd72b5
[tune] logging fixes, better warnings, better cluster support (#3906) 2019-02-02 19:14:03 -08:00
Yuhong Guo
54cbb4396f Prepare socket file when start ray (#3925) 2019-02-02 12:53:36 +08:00
Eric Liang
0f81bc9a33 [rllib] on_train_result results do not get logged (#3865) 2019-02-01 20:32:07 -08:00
Robert Nishihara
e0f82fd260 Fix building python 3.7 wheel by installing newer numpy. (#3927) 2019-02-01 18:06:48 -08:00
Daniel Edgecumbe
315edab085 [autoscaler] Speedups (#3720)
- NodeUpdater gets its' IP in parallel now (no longer in __init__)
- We use persistent connections in SSH (temp folder created only for ray; ControlMaster)
- hash_runtime_conf was performing a pointless hexlify step, wasting time on large files
- We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed
- AWSNodeProvider caches nodes more aggressively
- NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it
- AWSNodeProvider batches EC2 update_tags calls
- Logging changes throughout to provide standardised timing information for profiling
- Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway)

## Related issue number
Issue #3599
2019-02-01 02:46:32 -08:00
Daniel Edgecumbe
ff3c6af1d6 [autoscaler]: Remove assertion in info string (#3916)
Fixes #3903
2019-02-01 00:32:24 -08:00
Tianming Xu
1302fafc0b [Tune] Add export_formats option to export policy graphs (#3868)
In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint.

For Ray Tune users, these APIs are not accessible through YAML configurations.

In this pull request, export_formats option is provided to enable users to choose the desired export format.
2019-01-31 17:07:27 -08:00
Kristian Hartikainen
b9eed2e86c [autoscaler] Move attach helper text under exec_cluster (#3920)
## What do these changes do?
Moves the attach command helper from cli commands to the actual `exec_cluster` function.
2019-01-31 17:01:24 -08:00
Peter Schafhalter
62a0a7bdc7 [tune] Add BayesOpt (#3864)
Adds BayesOpt as a Tune suggestion algorithm.
2019-01-31 16:54:17 -08:00
Jimpachnet
d3551dd8df [tune] Added possibility to execute infinite recovery retries for a trial (#3901)
Allows to let a trial try to do infinite recoveries by setting _max_failures_ to a negative number.
2019-01-31 02:21:16 -08:00
Richard Liaw
d128636bab Ray Logging Configuration (#3691)
* fix logging for autoscaler

* module logging

* try this for logging

* yapf

* fix

* Initial logging setup

* momery

* ok

* remove basicconfig

* catch

* remove package logging

* print

* fix

* try_fix

* fix 1

* revert rllib

* logging level

* flake8

* fix

* fix

* Remove vestigal TODO
2019-01-30 21:01:12 -08:00
Robert Nishihara
d06d9fc5d7 Fix Python linting errors. (#3905) 2019-01-30 13:43:18 -08:00
Eric Liang
152375aa8a
[rllib] Add evaluation option to DQN agent (#3835)
* add eval

* interval

* multiagent minor fix

* Update rllib.rst

* Update ddpg.py

* Update qmix.py
2019-01-29 21:19:53 -08:00
Eric Liang
fb73cedf70
[rllib] Add examples page, add hierarchical training example, delete SC2 examples (#3815)
* wip

* lint

* wip

* up

* wip

* update examples

* wip

* remove carla

* update

* improve envspec

* link to custom

* Update rllib-env.rst

* update

* fix

* fn

* lint

* ds

* ssd games

* desc

* fix up docs

* fix
2019-01-29 21:06:09 -08:00
Bruno Morier
c9819a721d Update tempfile_services.py (#3896)
Fix an invalid reference to os.errno. errno have been removed from os in python 3.7. The fix only replaces it by the already imported errno.
2019-01-29 19:33:02 -08:00
Eric Liang
c75038b945
[autoscaler] Updating a file in file mounts causes all worker nodes to get restarted 2019-01-27 17:41:37 -08:00
Stephanie Wang
ad9f1721d1 Fix object_manager_test.py::object_transfer_retry test (#3863) 2019-01-27 13:55:38 -08:00
Yuhong Guo
066fa8abf3
Fix monitor_test.py by waiting for moniter.py to start working (#3840)
* Wait for moniter.py to start working

* Checkout None result in state.py
2019-01-25 18:07:15 +08:00
Philipp Moritz
20162ce159 Compile raylet cython bindings with bazel (#3842) 2019-01-25 00:57:31 -08:00
Si-Yuan
48139cf861 Migrate Python C extension to Cython (#3541) 2019-01-24 09:17:14 -08:00
Eric Liang
04ec47cbd4
[rllib] annotate public vs developer vs private APIs (#3808) 2019-01-23 21:27:26 -08:00
Wang Qing
816406ea3d [Java] Fix setCurrentTask() in multi threading (#3821) 2019-01-23 20:45:30 +08:00
Robert Nishihara
0b1608a546 Factor out code for starting new processes and test plasma store in valgrind. (#3824)
* Factor out starting Ray processes.

* Detect flags through environment variables.

* Return ProcessInfo from start_ray_process.

* Print valgrind errors at exit.

* Test valgrind in travis.

* Some valgrind fixes.

* Undo raylet monitor change.

* Only test plasma store in valgrind.
2019-01-22 14:59:11 -08:00
Eric Liang
f0e6523323
[rllib] Don't call reset() unless necessary for multi-agent envs 2019-01-20 15:00:18 -08:00