This reverts commit 02f220b755.
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
Looks like this commit makes `test_ray_shutdown` way more flaky. cc @mattip for further investigation after revert
<img width="760" alt="Screen Shot 2022-05-31 at 11 14 48 PM" src="https://user-images.githubusercontent.com/18510752/171339737-f48e6e90-391a-4235-bfac-a0aa0e563eb7.png">
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
When using ray inside a virtualenv on windows, python.exe as reported by sys.executable is a PEP397 launcher to the actual python as reported by os.getpid():
>>> import sys, os, psutil
>>> >>> print(sys.executable)
C:\temp\issue24361\Scripts\python.exe
>>> os.getpid()
2208
>>> child = psutil.Process(2208)
>>> child.cmdline()
['C:\\oss\\CPython38\\python.exe']
>>> child.parent().cmdline()
['C:\\temp\\issue24361\\Scripts\\python.exe']
>>> child.parent().pid
6424
When the agent_manager launches the agent process via Process::Process(), it gets the PID of the launcher process (6424), which is what is expected as an ID when registering the agent in the gRPC callback. But inside agent.py, the child process reports the PID via os.getpid(), which is 2208, and this is the wrong PID to register the agent.
The solution proposed here is another version of #24905 that creates a int agent_id = rand(); before starting the python process, and passes the agent_id to the process.
This makes it possible to use an NFS file system that is shared on a cluster for runtime_env working directories.
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This is a follow-up PRs of https://github.com/ray-project/ray/pull/24813 and https://github.com/ray-project/ray/pull/24628
Unlike the change in cpp layer, where the resubscription is done by GCS broadcast a request to raylet/core_worker and the client-side do the resubscription, in the python layer, we detect the failure in the client-side.
In case of a failure, the protocol is:
1. call subscribe
2. if timeout when doing resubscribe, throw an exception and this will crash the system. This is ok because when GCS has been down for a time longer than expected, we expect the ray cluster to be down.
3. continue to poll once subscribe ok.
However, there is an extreme case where things might be broken: the client might miss detecting a failure.
This could happen if the long-polling has been returned and the python layer is doing its own work. And before it sends another long-polling, GCS restarts and recovered.
Here we are not going to take care of this case because:
1. usually GCS is going to take several seconds to be up and the python layer's work is simply pushing data into a queue (sync version). For the async version, it's only used in Dashboard which is not a critical component.
2. pubsub in python layer is not doing critical work: it handles logs/errors for ray job;
3. for the dashboard, it can just restart to fix the issue.
A known issue here is that we might miss logs in case of GCS failure due to the following reasons:
- py's pubsub is only doing best effort publishing. If it failed too many times, it'll skip publishing the message (lose messages from producer side)
- if message is pushed to GCS, but the worker hasn't done resubscription yet, the pushed message will be lost (lose messages from consumer side)
We think it's reasonable and valid behavior given that the logs are not defined to be a critical component and we'd like to simplify the design of pubsub in GCS.
Another things is `run_functions_on_all_workers`. We'll plan to stop using it within ray core and deprecate it in the longer term. But it won't cause a problem for the current cases because:
1. It's only set in driver and we don't support creating a new driver when GCS is down.
2. When GCS is down, we don't support starting new ray workers.
And `run_functions_on_all_workers` is only used when we initialize driver/workers.
Packages are uploaded to the GCS for `runtime_env`. These packages are garbage collected when their refcount becomes zero.
The problem is the reference doesn't get incremented until the job starts, which happens after the package is uploaded. It's possible for the package's refcount to go to zero in between the upload and when the job starts, causing the package to be deleted before it's needed by the job. It's likely the cause of https://github.com/ray-project/ray/issues/23423.
We can't just increment the refcount at the time of upload, because if the script is killed before the job is started (e.g. via Ctrl-C) then the reference will never be decremented and the package will never be deleted.
The solution in this PR is to increment the refcount at the time of upload, but automatically decrement after a configurable timeout (default 30s). This should be enough time for the job to start. When the job starts, it increments the refcount as usual and decrements it when the job finishes or is killed.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
This PR adds 2 more states into TaskStatus
enum TaskStatus {
// The task is scheduled properly and waiting for execution.
// It includes time to deliver the task to the remote worker + queueing time
// from the execution side.
WAITING_FOR_EXECUTION = 5;
// The task that is running.
RUNNING = 6;
}
This PR supports setting the jars for an actor in Ray API. The API looks like:
```java
class A {
public boolean findClass(String className) {
try {
Class.forName(className);
} catch (ClassNotFoundException e) {
return false;
}
return true;
}
}
RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
.addJars(ImmutableList.of("https://github.com/ray-project/test_packages/raw/main/raw_resources/java-1.0-SNAPSHOT.jar"))
.build();
ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
boolean ret = actor1.task(A::findClass, "io.testpackages.Foo").remote().get();
System.out.println(ret); // true
```
Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`.
This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.
https://github.com/ray-project/ray/pull/14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine.
However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage.
This PR adds a flag to enable displaying disk usage in those situations.
On the ServeHead level, it is talking to serve api and controller to do deployment and clean up now. With this pr, it hides the deployment clean up logic into server.run() for code cleanness and easy to refactor in the future.
This PR allows for Ray to disable metrics collection. It was possible with RAY_enable_metrics_collection, but it didn't fully disable collection because there was a metrics collection happening from agent that wasn't properly disabled. This PR also adds tests.
Closes https://github.com/ray-project/ray/issues/24300
Adds a field to the job submission snapshot that matches the job name in the existing snapshot. Before this PR, the job submission name was camelcased because all snapshot keys are automatically camelcased. This PR allows jobs from the old job field to be linked to ones in the new job submission snapshot.
Show usage stats prompt when it's enabled.
Current UX are:
* The usage stats enabled or disabled message is shown every time in both terminal and dashboard.
* If users don't explicitly enable or disable usage stats, the first time they start a ray cluster interactively, they will be asked to confirm and will enable if no user action within 10s. If it's non-interactive, collection is enabled by default without confirmation.
* ray.init() doesn't collect usage stats
* Usage stats can be disabled via three approaches: 1. RAY_USAGE_STATS_ENABLED env var, 2. ray xxx --disable-usage-stats, 3. ray disable-usage-stats
This PR implements ray list tasks and ray list objects APIs.
NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.
Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files:
* `context.py`
* Intended to expose process-wide state to internal Serve code as well as `api.py`
* Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables
* `client.py`
* Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`
The test verifies the first line 43~51 bytes are "dashboard"
But due to recent code addition to head.py, the line where logs are written became 2 digits -> 3 digits
Previously,
2022-04-18 23:23:56,946 INFO head.py:[less than 100] -- Dashboard head grpc address: 127.0.0.1:57208
Now
2022-04-18 23:23:56,946 INFO head.py:101 -- Dashboard head grpc address: 127.0.0.1:57208
So we should increase the bytes range.
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
To avoid this error:
(raylet) Traceback (most recent call last):
(raylet) File "/home/iamhatesz/.pyenv/versions/alan-brain-py3.9/lib/python3.9/site-packages/ray/dashboard/agent.py", line 407, in <module>
(raylet) gcs_publisher = GcsPublisher(args.gcs_address)
(raylet) TypeError: __init__() takes 1 positional argument but 2 were given
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.
As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
The `Application` class is stored in `api.py`. The object is relatively standalone and is used as a dependency in other classes, so this change moves `Application` (and `ImmutableDeploymentDict`) to a new file, `application.py`.