Commit graph

2339 commits

Author SHA1 Message Date
Zhijun Fu
753ba76141 [Issue 2809][xray] Cleanup on driver detach (#2826)
This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass.

The following should happen when a driver exits (either gracefully or ungracefully).

#2797 should be enabled and pass.
Any actors created by the driver that are still running should be killed.
Any workers running tasks for the driver should be killed.
Any tasks for the driver in any node_manager queues should be removed.
Any future tasks received by a node manager for the driver should be ignored.
The driver death notification should only be received once.
2018-09-07 16:11:32 +08:00
Robert Nishihara
3f6ed537a4 Add ray.is_initialized() function. (#2818)
* Add ray.is_initialized() function.

* Add assert.
2018-09-06 21:20:59 -07:00
Eric Liang
e7db54bdb0 Log at INFO level by default (including in autoscaler). (#2824)
Before this change, the autoscaler `up` and related commands don't print any info messages to the console at all. This was a regression from 0.5. @richardliaw @robertnishihara https://github.com/ray-project/ray/issues/2812
2018-09-06 13:31:19 -07:00
Wang Qing
7e13e1fd49 [Java] Remove non-raylet code in Java. (#2828) 2018-09-06 14:54:13 +08:00
Eric Liang
d81605e9e7
[tune] Add a time/timesteps since last restore metric (#2819)
* rsm

* always log to avoid changing schema for csv writer

* add iter since restore

* update

* criteria warn
2018-09-05 17:45:09 -07:00
Eric Liang
995ac24a2c
[rllib] clarify train batch size for PPO (#2793)
It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.
2018-09-05 12:06:13 -07:00
Wang Qing
c87a9114cd Change the version number of Miniconda3. (#2829)
Change version number of Miniconda.

Change the version of Miniconda.
2018-09-05 12:05:04 -07:00
kary
4c0e2c3f58 [rllib]multi agent judge bug (#2821)
* fix multi agent judge bug

* Update policy_evaluator.py
2018-09-04 21:02:06 -07:00
Richard Liaw
72542c9016 [tune] Fix Pausing and Error Propogation (#2815)
* add new tests

* Try-catch errors from ray get

* longer pbt run

* Update pbt_example.py

* Split trial and result and fix tests
2018-09-04 15:22:11 -07:00
Yuhong Guo
dfb7c2be1e [Java] Add Plasma Free to Java code path (#2802) 2018-09-04 15:28:23 +08:00
Eric Liang
25ffe57a5c
[rllib] Auto-synchronize filters for all agents (#2791)
This makes sure we always update the local filter, and adds an option to synchronize the remote filters as well. In APEX_DDPG we previously didn't do either. The first is needed for checkpoint correctness, the second might help performance.
2018-09-03 20:01:53 -07:00
Philipp Moritz
a34a7172b4 Remove gflags (#2813)
Seems like gflags is not needed. This *might* remove writing spurious files into the home directory on the RISE infrastructure.
2018-09-03 16:10:47 -07:00
Eric Liang
01b030bd57
[rllib] throw an error for continuous action spaces in IMPALA
We currently don't support this since the reference vtrace.py does not, though it could be an interesting extension.
2018-09-03 11:12:55 -07:00
Eric Liang
df4788e501
[rllib/tune] Add test for fractional gpu support in xray mode; add rllib support for fractional gpu (#2768)
* frac gpu

* doc

* Update rllib-training.rst

* yapf

* remove xray
2018-09-03 11:12:23 -07:00
Hao Chen
9d655721e5 [java] support creating an actor with parameters (#2817)
Previously `Ray.createActor` only support creating an actor without any parameter. This PR adds the support for creating an actor with parameters. Moreover, besides using a constructor, it's now also allowed to create an actor with a factory method. For more usage, prefer refer to `ActorTest.java`.
2018-09-03 09:53:03 -07:00
Eric Liang
b37a283053 [rllib] support local mode (#2795) 2018-09-02 23:02:19 -07:00
Robert Nishihara
0ac855e061 Push errors to all drivers when node is marked dead. (#2808)
* Push errors to all drivers when node is marked dead.

* Fix
2018-09-02 20:04:58 -07:00
Robert Nishihara
c71bbbc3af Add test (currently skipped) that drivers release resources when exiting. (#2797)
* Add test (currently skipped) that drivers release resources when exiting.

* Add test for ungraceful driver exit.

* Small fix.

* Small fix
2018-09-02 17:34:48 -07:00
Robert Nishihara
e5fd1d55a1 Ignore failing global sheduler valgrind test. (#2805) 2018-09-02 15:23:32 -07:00
Hao Chen
3b0a2c4197 [Java] improve Java API module (#2783)
API module (`ray/java/api` dir) includes all public APIs provided by Ray, it should be the only module that normal Ray users need to face.

The purpose of this PR to first improve the code quality of the API module. Subsequent PRs will improve other modules later. The changes of this PR include the following aspects: 
1) Only keep interfaces in api module, to hide implementation details from users and fix circular dependencies among modules.
2) Document everything in the api module. 
3) Improve naming.
4) Add more tests for API. 
5) Also fix/improve related code in other modules.
6) Remove some unused code.

(Apologize for posting such a large PR. Java worker code has been lack of maintenance for a while. There're a lot of code quality issues that need to be fixed. We plan to use a couple of large PRs to address them. After that, future changes will come in small PRs.)
2018-09-02 11:51:16 -07:00
Yuhong Guo
2691b3a11a Add signal handlers to improve debuggability (#2757)
* Add signal handlers to improve debuggability.

* Fix Linux compiling

* Fix Lint

* Change SIGILL case that happens in both Linux and MaxOs

* Add signal handler to main functions.

* Change handler name.

* Address comment

* Address comment.

* Fix Linux building failure

* Introduce RAII mechanism to SignalHandlers.

* Add InitShutdownWrapper to handle all RAII requirements

* Change util_test to signal_test

* Make sure shutdown is not nullptr.

* Using google::InstallFailureSignalHandler() instead of our own signal handler

* Refine code addording to comment

* Fix valgrind test failure.

* remove Shutdown template

* consistency

* linting
2018-09-01 21:58:23 -07:00
Robert Nishihara
1c50082498 Re-enable sharded monitor test for xray, convert to pytest. (#2804) 2018-09-01 19:53:40 -07:00
Philipp Moritz
869ee8e25d Integrate plasma store list facility (#2752) 2018-09-01 16:53:51 -07:00
Alexey Tumanov
fdc9688226 [xray] push warning to driver for infeasible tasks (#2784)
This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.
2018-09-01 13:21:27 -07:00
Philipp Moritz
4db196438b fix 'from ray.rllib import ppo' in doc (#2794) 2018-08-31 23:34:47 -07:00
Yucong He
5b45f0bdff [xray] Implementing Gcs sharding (#2409)
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
2018-08-31 15:54:30 -07:00
Robert Nishihara
eda6ebb87d Convert some unittests to pytest. (#2779)
* Convert multi_node_test.py to pytest.

* Convert array_test.py to pytest.

* Convert failure_test.py to pytest.

* Convert microbenchmarks to pytest.

* Convert component_failures_test.py to pytest and some minor quotes changes.

* Convert tensorflow_test.py to pytest.

* Convert actor_test.py to pytest.

* Fix.

* Fix
2018-08-31 11:24:15 -07:00
wangyiguang
3813ae34b3 [tune] Add AutoMLBoard: Monitoring UI (experimental) (#2574) 2018-08-31 00:26:44 -07:00
Ryan Sepassi
b6260003cb Some small changes (#2782)
* Add some imports that make it easier to build with Bazel
* Use "/tmp" paths for sockets in tests
* Move `asio_test` into `run_gcs_tests.sh` instead of starting and stopping Redis within the test fixture with a `system` call.
2018-08-30 22:42:49 -07:00
Richard Liaw
0347e6418b
[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708) 2018-08-30 16:18:56 -07:00
Robert Nishihara
224d38cbb2 Name Python threads. (#2767) 2018-08-30 11:08:24 -07:00
Robert Nishihara
32f7d6fcf5 Add back some tests for xray. (#2772) 2018-08-30 11:07:23 -07:00
Yuhong Guo
9f06c19edd Fix glog wheel failure on MacOS (#2775) 2018-08-30 09:06:19 -07:00
Wang Qing
514633456b [Java] Fix out-dated signatures of JNI methods (#2756)
1) Renamed the native JNI methods and some parameters of JNI methods. 
2) Fixed native JNI methods' signatures by `javah` tool.
3) Removed some useless native methods.
2018-08-30 17:59:29 +08:00
Robert Nishihara
ba7efafa67 Remove force_start argument from StartWorkerProcess. (#2762)
This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.
2018-08-30 13:43:47 +08:00
Robert Nishihara
5021795190 Update documents to replace 0.5.0 with 0.5.2. (#2761)
* Update documents to replace 0.5.0 with 0.5.1.

* Update documentation from 0.5.1 -> 0.5.2.
2018-08-29 21:05:09 -07:00
Robert Nishihara
f4f3478b45 Bump version number to 0.5.2. (#2765) 2018-08-29 13:39:25 -07:00
Praveen Palanisamy
357c0d6156 [tune] Adds option to checkpoint at end of trials (#2754)
* Added checkpoint_at_end option. To fix #2740

* Added ability to checkpoint at the end of trials if the option is set to True

* checkpoint_at_end option added; Consistent with Experience and Trial runner

* checkpoint_at_end option mentioned in the tune usage guide

* Moved the redundant checkpoint criteria check out of the if-elif

* Added note that checkpoint_at_end is enabled only when checkpoint_freq is not 0

* Added test case for checkpoint_at_end

* Made checkpoint_at_end have an effect regardless of checkpoint_freq

* Removed comment from the test case

* Fixed the indentation

* Fixed pep8 E231

* Handled cases when trainable does not have _save implemented

* Constrained test case to a particular exp using the MockAgent

* Revert "Constrained test case to a particular exp using the MockAgent"

This reverts commit e965a9358ec7859b99a3aabb681286d6ba3c3906.

* Revert "Handled cases when trainable does not have _save implemented"

This reverts commit 0f5382f996ff0cbf3d054742db866c33494d173a.

* Simpler test case for checkpoint_at_end

* Preserved bools from loosing their actual value

* Revert "Moved the redundant checkpoint criteria check out of the if-elif"

This reverts commit 783005122902240b0ee177e9e206e397356af9c5.

* Fix linting error.
2018-08-29 13:14:17 -07:00
Robert Nishihara
6edbbf4fbf Document the release process. (#2760) 2018-08-29 00:06:33 -07:00
Robert Nishihara
132f133214 Limit number of concurrent workers started by hardware concurrency. (#2753)
* Limit number of concurrent workers started by hardware concurrency.

* Check if std:🧵:hardware_concurrency() returns 0.

* Pass in max concurrency from Python.

* Fix Java call to startRaylet.

* Fix typo

* Remove unnecessary cast.

* Fix linting.

* Cleanups on Java side.

* Comment back in actor test.

* Require maximum_startup_concurrency to be at least 1.

* Fix linting and test.

* Improve documentation.

* Fix typo.
2018-08-29 14:53:40 +08:00
Mitar
3850e3ba64 Added extra logging related arguments to "ray start" (#2664) 2018-08-28 23:00:37 -07:00
Eric Liang
69d1354016
[rllib] Document ARS & rainbow (#2744)
* wip

* rainbow doc too

* e not used

* fix ppo doc

* clean list

* use same title
2018-08-28 18:13:36 -07:00
Robert Nishihara
6e1de19cc2 Bump version to 0.5.1. (#2755) 2018-08-28 16:52:17 -07:00
Robert Nishihara
b7722897b4 Deprecate 'driver_mode' argument. (#2758)
* Deprecate 'driver_mode' argument.

* Fix

* Fix
2018-08-28 16:45:49 -07:00
Alexey Tumanov
de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
adoda
90ae8f11df The function get_node_ip_address while catch an exception and return … (#2722)
…'127.0.0.1',

when we forbid the external network. Instead of we can get ip address from hostname.

The function get_node_ip_address while catch an exception and return '127.0.0.1' when we forbid the external network. Instead of we can get ip address from hostname.

https://github.com/ray-project/ray/issues/2721
2018-08-27 22:24:49 -07:00
Wang Qing
b4cba9a49f [java] Fix the logic of generating TaskID (#2747)
## What do these changes do?
Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code.
In this change,  I rewrote the logic of generating `TaskID` in java which is the same as the python's.

In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too.

## Related issue number
[#2608](https://github.com/ray-project/ray/issues/2608)
2018-08-27 13:11:33 -07:00
Hao Chen
f37c260bdb [multi-language part 3] support multiple languages in raylet backend (#2672)
This PR enables multi-language support in the raylet backend.
- `Worker` class now has a `language` label;
- `WorkerPool`:
	- It now maintains one set of states for each language.
	- `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id.
    -  `Size` and `StartWorkerProcess` functions now have an extra `language` parameter.
- `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.
2018-08-26 22:06:25 -07:00
Yuhong Guo
0b6e08ebee Separate python logger module-wise (#2703)
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.

## Related issue number
https://github.com/ray-project/ray/pull/2664
2018-08-26 13:46:14 -07:00
Wang Qing
26d3c0655c [java] Improve UniqueID code. (#2723) 2018-08-26 12:32:57 -07:00