Commit graph

379 commits

Author SHA1 Message Date
Edward Oakes
cb7bcbd651
[job submission] Fix address defaulting behavior (#24970)
Per the discussion in https://github.com/ray-project/ray/issues/24858:

- If an address without a port is provided, don't append a port.
- Default to `http://localhost:8265` if nothing is provided.
2022-05-20 14:10:36 -05:00
SangBin Cho
b9c30529d8
[Core/Observability 1/N] Add a "running" state to task status (#24651)
This PR adds 2 more states into TaskStatus

enum TaskStatus {
  // The task is scheduled properly and waiting for execution.
  // It includes time to deliver the task to the remote worker + queueing time
  // from the execution side.
  WAITING_FOR_EXECUTION = 5;
  // The task that is running.
  RUNNING = 6;
}
2022-05-16 05:39:05 -07:00
Jiajun Yao
628f886af4
Don't show usage stats prompt in dashboard if prompt is disabled (#24700) 2022-05-12 07:55:28 -07:00
Qing Wang
259661042c
[runtime env] [java] Support jars in runtime env for Java (#24170)
This PR supports setting the jars for an actor in Ray API. The API looks like:
```java
class A {
    public boolean findClass(String className) {
      try {
        Class.forName(className);
      } catch (ClassNotFoundException e) {
        return false;
      }
      return true;
    }
}

RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
    .addJars(ImmutableList.of("https://github.com/ray-project/test_packages/raw/main/raw_resources/java-1.0-SNAPSHOT.jar"))
    .build();
ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
boolean ret = actor1.task(A::findClass, "io.testpackages.Foo").remote().get();
System.out.println(ret); // true
```
2022-05-12 09:34:40 +08:00
Jiajun Yao
1daad65568
[Doc] Add doc for usage stats collection (#24522) 2022-05-10 17:18:49 -07:00
Edward Oakes
4c1f27118a
[job submission] Don't set CUDA_VISIBLE_DEVICES in job driver (#24546)
Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`.

This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.
2022-05-10 11:43:04 -05:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
Dmitri Gekhtman
6d09244a7e
[Dashboard][K8s] Add toggle to enable showing node disk usage on K8s (#24416)
https://github.com/ray-project/ray/pull/14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine.

However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage.

This PR adds a flag to enable displaying disk usage in those situations.
2022-05-03 10:58:05 -05:00
SangBin Cho
2bce07d4ce
[State API] List runtime env API (#24126)
This PR supports list runtime env API
2022-05-02 14:01:00 -07:00
Sihan Wang
59debac670
[Serve] Move deployment clean up under serve.run() api (#24306)
On the ServeHead level, it is talking to serve api and controller to do deployment and clean up now. With this pr, it hides the  deployment clean up logic into server.run() for code cleanness and easy to refactor in the future.
2022-05-02 12:10:11 -05:00
SangBin Cho
6f192b6e17
[Metrics] Allow to completely disable metrics collection (#24333)
This PR allows for Ray to disable metrics collection. It was possible with RAY_enable_metrics_collection, but it didn't fully disable collection because there was a metrics collection happening from agent that wasn't properly disabled. This PR also adds tests.
2022-05-02 05:33:03 -07:00
Philipp Moritz
27917f570d
[runtime_env] Extend runtime_env hook to also cover jobs (#24328)
This extends https://github.com/ray-project/ray/pull/24036 to also cover job submission.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-04-30 09:15:51 -07:00
Archit Kulkarni
1b67e6a8ae
[Jobs] [Dashboard] Add job submission id as field to job snapshot (#24303)
Closes https://github.com/ray-project/ray/issues/24300

Adds a field to the job submission snapshot that matches the job name in the existing snapshot.  Before this PR, the job submission name was camelcased because all snapshot keys are automatically camelcased.  This PR allows jobs from the old job field to be linked to ones in the new job submission snapshot.
2022-04-29 10:10:24 -05:00
Jiajun Yao
8fdde12e9e
Delay 1 minutes for the first usage stats report (#24291)
Delay the first report for 1 minutes so the system is probably set up and we can get the information to report.
2022-04-28 22:53:33 -07:00
Archit Kulkarni
cc864401fb
[Dashboard] Add environment variable flag to skip dashboard log processing (#24263) 2022-04-27 15:33:08 -07:00
Archit Kulkarni
12b9383d52
[Jobs] Reenable test_backwards_compatibility using Ray 1.12 (#24124)
Closes https://github.com/ray-project/ray/issues/23258
2022-04-26 13:53:51 -05:00
Archit Kulkarni
27e7c284ee
[Jobs] Change jobs start_time end_time from seconds to ms for consistency (#24123)
In the snapshot, all timestamps are given in ms except for Jobs:

```
wget -q -O - http://127.0.0.1:8265/api/snapshot

{
   "result":true,
   "msg":"hello",
   "data":{
      "snapshot":{
         "jobs":{
            "01000000":{
               "status":null,
               "statusMessage":null,
               "isDead":false,
               "startTime":1650315791249,
               "endTime":0,
               "config":{
                  "namespace":"_ray_internal_dashboard",
                  "metadata":{
                     
                  },
                  "runtimeEnv":{
                     
                  }
               }
            }
         },
         "jobSubmission":{
            "raysubmit9Bsej1Rtxqqetxup":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650315925,
               "endTime":1650315926,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"ls"
            },
            "raysubmitEibragqkyg16Hpcj":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650316039,
               "endTime":1650316041,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"echo hi"
            },
            "raysubmitSh1U7Grdsbqrf6Je":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650316354,
               "endTime":1650316355,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"echo hi"
            }
         },
         "actors":{
            "8c8e28e642ba2cfd0457d45e01000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_9BSeJ1rTXQqEtXuP",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650315926620,
               "endTime":1650315927499,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"9628b5eb54e98353601413845fbca0a8c4e5379d1469ce95f3dfbace",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10003,
               "metadata":{
                  
               }
            },
            "a7fd8354567129910c44298401000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_sh1u7grDsBQRf6je",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650316355718,
               "endTime":1650316356620,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"f07fd7a393898bf7d9027a5de0b0f566bb64ae80c0fcbcc107185505",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10005,
               "metadata":{
                  
               }
            },
            "19ca9ad190f47bae963592d601000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_eibRAGqKyG16HpCj",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650316041089,
               "endTime":1650316041978,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"50b8e7e9a6981fe0270afd7f6387bc93788356822c9a664c2988f5ba",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10004,
               "metadata":{
                  
               }
            }
         },
         "deployments":{
            
         },
         "sessionName":"session_2022-04-18_13-49-44_814862_139",
         "rayVersion":"1.12.0",
         "rayCommit":"f18fc31c7562990955556899090f8e8656b48d2d"
      }
   }
}
```

 This PR fixes the inconsistency by changing Jobs start/end timestamps to ms.
2022-04-26 08:37:41 -07:00
Jiajun Yao
3fb63847e2
Show usage stats prompt (#23822)
Show usage stats prompt when it's enabled.

Current UX are:

* The usage stats enabled or disabled message is shown every time in both terminal and dashboard.
* If users don't explicitly enable or disable usage stats, the first time they start a ray cluster interactively, they will be asked to confirm and will enable if no user action within 10s. If it's non-interactive, collection is enabled by default without confirmation.
* ray.init() doesn't collect usage stats
* Usage stats can be disabled via three approaches: 1. RAY_USAGE_STATS_ENABLED env var, 2. ray xxx --disable-usage-stats, 3. ray disable-usage-stats
2022-04-25 16:01:24 -07:00
SangBin Cho
73ed67e9e6
[State API] State api limit + Removing unnecessary modules (#24098)
This PR does

Move all routes into the same module, state_head.py
Support a limit feature.
2022-04-22 15:59:46 -07:00
SangBin Cho
30ab5458a7
[State Observability] Tasks and Objects API (#23912)
This PR implements ray list tasks and ray list objects APIs.

NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.
2022-04-21 18:45:03 -07:00
shrekris-anyscale
b51d0aa8b1
[serve] Introduce context.py and client.py (#24067)
Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files:

* `context.py`
    * Intended to expose process-wide state to internal Serve code as well as `api.py`
    * Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables
* `client.py`
    * Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`
2022-04-21 18:35:09 -05:00
jon-chuang
ddcc252b51
[Core] Ray logs API (1/n) (#23435)
Expose HTTP endpoint to retrieve logs from ray cluster
2022-04-20 23:11:02 -07:00
Chu Xiangyang
6f74040b15
[Job] Fix typo in job sdk docstring (#23940) 2022-04-20 12:30:32 -05:00
SangBin Cho
082baa2342
[Test] Fix test_log (#24004)
The test verifies the first line 43~51 bytes are "dashboard"

But due to recent code addition to head.py, the line where logs are written became 2 digits -> 3 digits

Previously,
2022-04-18 23:23:56,946	INFO head.py:[less than 100] -- Dashboard head grpc address: 127.0.0.1:57208
 
Now
2022-04-18 23:23:56,946	INFO head.py:101 -- Dashboard head grpc address: 127.0.0.1:57208
So we should increase the bytes range.
2022-04-19 04:59:30 -07:00
SangBin Cho
1c3329fa38
Revert "Revert "[State Observability] Basic functionality for central… (#23933)
…ized data (#23744)" (#23918)"

This reverts commit fb14e82.
2022-04-18 21:15:43 -07:00
shrekris-anyscale
6151b75d9d
[serve] Move schema helpers out of api.py (#23934) 2022-04-18 12:25:21 -05:00
mwtian
d5d2ef4249
[Core] Add a utility to check GCS / Ray cluster health (#23382)
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
2022-04-18 09:58:45 -07:00
Amog Kamsetty
fb14e82242
Revert "[State Observability] Basic functionality for centralized data (#23744)" (#23918)
This reverts commit 51a4a1a802.

breaking tune multinode tests and kuberay:test_autoscaling_e2e
2022-04-14 14:28:42 -07:00
Tomasz Wrona
46e0162441
GcsPublisher is being constructed with unsupported position argument
To avoid this error:

(raylet) Traceback (most recent call last):
(raylet)   File "/home/iamhatesz/.pyenv/versions/alan-brain-py3.9/lib/python3.9/site-packages/ray/dashboard/agent.py", line 407, in <module>
(raylet)     gcs_publisher = GcsPublisher(args.gcs_address)
(raylet) TypeError: __init__() takes 1 positional argument but 2 were given
2022-04-14 10:47:57 -07:00
SangBin Cho
51a4a1a802
[State Observability] Basic functionality for centralized data (#23744)
Support listing actor/pg/job/node/workers

Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2

Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.
2022-04-14 07:33:18 -07:00
Tao Wang
6aefe9b36e
[Core]Save task spec in separate table (#22650)
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.

As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
2022-04-12 12:24:26 -07:00
Amog Kamsetty
1d11963618
[Dashboard] Specify @types/react resolution (#23794)
A new @types/react release has broken the dashboard build. Make sure to specify the older version under package resolutions.
2022-04-07 17:24:19 -07:00
shrekris-anyscale
a6bcb6cd1e
[serve] Create application.py (#23759)
The `Application` class is stored in `api.py`. The object is relatively standalone and is used as a dependency in other classes, so this change moves `Application` (and `ImmutableDeploymentDict`) to a new file, `application.py`.
2022-04-07 10:34:24 -05:00
Stephanie Wang
b43426bc33
[core] Add metrics for disk and network I/O (#23546)
Adds some metrics useful for object-intensive workloads:

    Per raylet/object manager:
        Add num bytes pending restore to spill manager
        Add num requests cumulative to PullManager
        Num bytes pushed/pulled from other nodes cumulative
        Histogram for request latencies in PullManager:
            total life time of request, from start to cancel
            request satisfaction time, from start to object local
            pull time, from object activation to object local
    Per-node disk read/write speed, IOPS
2022-04-01 11:15:34 -07:00
mwtian
1a4c3c07f7
[Dashboard] fix iterating over GPU processes (#23562)
Current logic looks broken, as reported in #22954 (comment)

I fixed the logic as best as I can, and tested it on Anyscale platform with GPU. No process info was reported from gpustat. But the logic works under this case.
2022-03-31 17:16:53 -07:00
Matti Picus
77c4c1e48e
WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449) 2022-03-29 00:56:42 -07:00
mwtian
c2404cce62
avoid adding gpu utilization when unavailable (#23468)
From #22954, GPU utilization can be unavailable for consumer hardware. So dashboard should not assume the value cannot be None.

There might be a better way to represent "not reported". But currently utilizations are summed up which makes using non-zero to represent "not reported" hard to do.
2022-03-24 22:48:17 -07:00
Max Pumperla
60054995e6
[docs] fix doctests and activate CI (#23418) 2022-03-24 17:04:02 -07:00
mwtian
51feac9868
Clean up dev docs (#23407) 2022-03-22 23:22:56 -07:00
shrekris-anyscale
b00977b1b1
[serve] Remove dashboard's dependency on Serve (#23389) 2022-03-21 22:14:41 -07:00
Guyang Song
69af9764b2
[runtime env] URI reference refactor (#22828)
- Move the URI reference logic from raylet to agent.
- Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible`
- More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528

Future works
- We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this.
- Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count.
- Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?
2022-03-21 11:21:15 -05:00
shrekris-anyscale
75b7465ba4
[serve] Reject Ray client addresses when submitting via Dashboard (#23339)
Some commands in the Serve CLI use Ray client and some commands ping the Ray dashboard; however, all commands read `RAY_ADDRESS` to get the address. This change raises a nice exception if the user accidentally passes a Ray client address as the Ray Dashboard address.
2022-03-21 11:17:51 -05:00
shrekris-anyscale
c668039020
[serve] Restore "Get new handle to controller if killed" (#23283) (#23338)
#23336 reverted #23283. #23283 did pass CI before merging. However, when it merged, it began to fail because it used commands that were outdated on the Master branch in `test_cli.py` (specifically `serve info` instead of `serve config`). This change restores #23283 and updates its tests commands.
2022-03-18 18:40:08 -05:00
shrekris-anyscale
87e77bebb4
Revert "[serve] Get new handle to controller if killed (#23283)" (#23336)
This reverts commit 9f6d96a2fd.
2022-03-18 13:47:57 -05:00
Jialing He
4a83bc3dc2
[runtime env] Support set timeout for runtime env setup (#23082)
Interface example:
```python
@ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10))
def f(): pass

@ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}})
def f(): pass
```

Support set timeout second for timeout of runtime environment creation.

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-03-18 12:52:59 -05:00
Archit Kulkarni
76bb5396c7
[Doc] [jobs] Add links to Job Submission and improve doc (#23209)
- Adds links to Job Submission from existing library tutorials where `ray submit` is used.  When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 12:52:13 -05:00
shrekris-anyscale
9f6d96a2fd
[serve] Get new handle to controller if killed (#23283)
`serve shutdown` is not idempotent with the new Serve CLI. When serve shuts down, it kills the controller. The REST API does not refresh its cached controller handle, so it attempts to make requests to a dead actor, which fail.

This change updates the REST API and `serve.start()` to refresh the controller handle if the controller has been killed.
2022-03-18 11:47:18 -05:00
shrekris-anyscale
aaf47b2493
[serve] Implement serve.run() and Application (#23157)
These changes expose `Application` as a public API. They also introduce a new public method, `serve.run()`, which allows users to deploy their `Applications` or `DeploymentNodes`. Additionally, the Serve CLI's `run` command and Serve's REST API are updated to use `Applications` and `serve.run()`.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 11:12:09 -05:00
Jiajun Yao
62a5404369
Collect more usage stats data (#23167) 2022-03-17 19:33:27 -07:00
Archit Kulkarni
77090144a2
[jobs] Add entrypoint field to JobInfo (#23253) 2022-03-16 22:02:22 -05:00