This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.
$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
--devices-per-worker=M --strategy=<simple|ps> \
--warmup --object-store-memory=10000000000
Images per second total
gpus total | simple | ps
========================================
1 | 218
2 (1 worker) | 388
4 (1 worker) | 759
4 (2 workers) | 176 | 623
8 (1 worker) | 985
8 (2 workers) | 349 | 1031
16 (2 nodes, 2 workers) | 600 | 1661
16 (2 nodes, 4 workers) | 468 | 1712 <--- OSDI perf was 1817
We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.).
This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.
<!--
Thank you for your contribution!
Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->
## What do these changes do?
support raylet, which is started by java runManager, to start python default_worker.py .
So when doing local test of java call python task, it helps auto start python worker.
## Related issue number
<!-- Are there any issues opened that will be resolved by merging this change? -->
This is fixing a problem that @devin-petersohn observed on the windows subsystem for linux.
In theory, redis should be up once the async connect is happening and there should be no retries needed for the async connect. However on the windows subsystem for linux, the async connect was failing even though the synchronous one was working. Maybe windows has a different semantics here than linux.
<!--
Thank you for your contribution!
Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->
## What do these changes do?
remove TaskExecutionException, use RayException instead
<!-- Please give a short brief about these changes. -->
## Related issue number
<!-- Are there any issues opened that will be resolved by merging this change? -->
## What do these changes do?
Before this PR, if we want to specify some resources, we must do as following codes:
```java
@RayRemote(Resources={ResourceItem("CPU", 10)})
public static void f1() {
// do sth
}
@RayRemote(Resources={ResourceItem("CPU", 10)})
class Demo {
// sth
}
```
Unfortunately, it's no way for us to create another actor or task with different resources required.
After this PR, the thing will be:
```java
ActorCreationOptions option = new ActorCreationOptions();
option.resources.put("CPU", 4.0);
RayActor<Echo> echo1 = Ray.createActor(Echo::new, option);
option.resources.put("Res-A", 4.0);
RayActor<Echo> echo2 = Ray.createActor(Echo::new, option);
//if we don't specify resource, the resources will be `{"cpu":0.0}` by default.
Ray.call(Echo::echo, echo2, 100);
```
## Related issue number
N/A
## What do these changes do?
Fix the misleading comments in code for:
- `EPISODES_THIS_ITER`
- `EPISODES_TOTAL`
Had noted it before and planned to fix it along with some other changes but seemed very relevant to stay next to #3058 so sending this now.
* bugfix: env exists check error
* support to avoid re-build pyarrow in project
* bugfix: adapt gtest for centos lib64
* bugfix: check gtest lib exists in the directory
* bugfix: find gtest with checking all libs exists
* prefix RAY_ to thirdparty env variables to avoid conflicts with other module
* arrow use glog from ray
* change the glog and gtest install dir
## What do these changes do?
1. Add a configuration item `driver.resource-path`.
2. Load driver resources from the local path which is specified in the `ray.conf`.
Before this change, we should add all driver resources(like user's jar package, dependencies package and config files) into `classpath`.
After this change, we should add the driver resources into the mount path which we can configure it in `ray.conf`, and we shouldn't configure `classpath` for driver resources any more.
## Related issue number
N/A