* Policy that flushes the lineage stash immediately
* Fix bug where remote tasks in uncommitted lineage weren't getting subscribed to, add reg test
* test
* Fix bug where waiting task was getting subscribed
* Cleanup
* Update src/ray/raylet/lineage_cache.cc
Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>
* Update src/ray/raylet/lineage_cache.cc
Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>
* cleanup
* cleanup
* Add another test for task with many parents
* fix, unsubscribe to new waiting tasks
* Unsubscribe as soon as the commit notification is handled
This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.
$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
--devices-per-worker=M --strategy=<simple|ps> \
--warmup --object-store-memory=10000000000
Images per second total
gpus total | simple | ps
========================================
1 | 218
2 (1 worker) | 388
4 (1 worker) | 759
4 (2 workers) | 176 | 623
8 (1 worker) | 985
8 (2 workers) | 349 | 1031
16 (2 nodes, 2 workers) | 600 | 1661
16 (2 nodes, 4 workers) | 468 | 1712 <--- OSDI perf was 1817
We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.).
This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.
<!--
Thank you for your contribution!
Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->
## What do these changes do?
support raylet, which is started by java runManager, to start python default_worker.py .
So when doing local test of java call python task, it helps auto start python worker.
## Related issue number
<!-- Are there any issues opened that will be resolved by merging this change? -->
This is fixing a problem that @devin-petersohn observed on the windows subsystem for linux.
In theory, redis should be up once the async connect is happening and there should be no retries needed for the async connect. However on the windows subsystem for linux, the async connect was failing even though the synchronous one was working. Maybe windows has a different semantics here than linux.
<!--
Thank you for your contribution!
Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->
## What do these changes do?
remove TaskExecutionException, use RayException instead
<!-- Please give a short brief about these changes. -->
## Related issue number
<!-- Are there any issues opened that will be resolved by merging this change? -->
## What do these changes do?
Before this PR, if we want to specify some resources, we must do as following codes:
```java
@RayRemote(Resources={ResourceItem("CPU", 10)})
public static void f1() {
// do sth
}
@RayRemote(Resources={ResourceItem("CPU", 10)})
class Demo {
// sth
}
```
Unfortunately, it's no way for us to create another actor or task with different resources required.
After this PR, the thing will be:
```java
ActorCreationOptions option = new ActorCreationOptions();
option.resources.put("CPU", 4.0);
RayActor<Echo> echo1 = Ray.createActor(Echo::new, option);
option.resources.put("Res-A", 4.0);
RayActor<Echo> echo2 = Ray.createActor(Echo::new, option);
//if we don't specify resource, the resources will be `{"cpu":0.0}` by default.
Ray.call(Echo::echo, echo2, 100);
```
## Related issue number
N/A
## What do these changes do?
Fix the misleading comments in code for:
- `EPISODES_THIS_ITER`
- `EPISODES_TOTAL`
Had noted it before and planned to fix it along with some other changes but seemed very relevant to stay next to #3058 so sending this now.