Commit graph

24 commits

Author SHA1 Message Date
Eric Liang
43aa2299e6
[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695)
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
2022-06-21 15:13:29 -07:00
Edward Oakes
65d21b7ae6
[job submission] Handle env_vars: None case properly in supervisor runtime_env logic (#25087) 2022-05-24 11:01:19 -05:00
Edward Oakes
4c1f27118a
[job submission] Don't set CUDA_VISIBLE_DEVICES in job driver (#24546)
Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`.

This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.
2022-05-10 11:43:04 -05:00
Archit Kulkarni
27e7c284ee
[Jobs] Change jobs start_time end_time from seconds to ms for consistency (#24123)
In the snapshot, all timestamps are given in ms except for Jobs:

```
wget -q -O - http://127.0.0.1:8265/api/snapshot

{
   "result":true,
   "msg":"hello",
   "data":{
      "snapshot":{
         "jobs":{
            "01000000":{
               "status":null,
               "statusMessage":null,
               "isDead":false,
               "startTime":1650315791249,
               "endTime":0,
               "config":{
                  "namespace":"_ray_internal_dashboard",
                  "metadata":{
                     
                  },
                  "runtimeEnv":{
                     
                  }
               }
            }
         },
         "jobSubmission":{
            "raysubmit9Bsej1Rtxqqetxup":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650315925,
               "endTime":1650315926,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"ls"
            },
            "raysubmitEibragqkyg16Hpcj":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650316039,
               "endTime":1650316041,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"echo hi"
            },
            "raysubmitSh1U7Grdsbqrf6Je":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650316354,
               "endTime":1650316355,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"echo hi"
            }
         },
         "actors":{
            "8c8e28e642ba2cfd0457d45e01000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_9BSeJ1rTXQqEtXuP",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650315926620,
               "endTime":1650315927499,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"9628b5eb54e98353601413845fbca0a8c4e5379d1469ce95f3dfbace",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10003,
               "metadata":{
                  
               }
            },
            "a7fd8354567129910c44298401000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_sh1u7grDsBQRf6je",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650316355718,
               "endTime":1650316356620,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"f07fd7a393898bf7d9027a5de0b0f566bb64ae80c0fcbcc107185505",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10005,
               "metadata":{
                  
               }
            },
            "19ca9ad190f47bae963592d601000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_eibRAGqKyG16HpCj",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650316041089,
               "endTime":1650316041978,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"50b8e7e9a6981fe0270afd7f6387bc93788356822c9a664c2988f5ba",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10004,
               "metadata":{
                  
               }
            }
         },
         "deployments":{
            
         },
         "sessionName":"session_2022-04-18_13-49-44_814862_139",
         "rayVersion":"1.12.0",
         "rayCommit":"f18fc31c7562990955556899090f8e8656b48d2d"
      }
   }
}
```

 This PR fixes the inconsistency by changing Jobs start/end timestamps to ms.
2022-04-26 08:37:41 -07:00
Archit Kulkarni
77090144a2
[jobs] Add entrypoint field to JobInfo (#23253) 2022-03-16 22:02:22 -05:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Edward Oakes
58e5f0140d
[jobs] Rename JobData -> JobInfo (#22499)
`JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.
2022-02-22 16:18:16 -06:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Edward Oakes
8806b2d5c4
[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180) 2022-02-07 15:25:25 -06:00
Jiao
a692e7d05e
[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938)
As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address.

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2022-02-04 17:51:43 -06:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Archit Kulkarni
f058a1d342
[Jobs] Stream logs during job instead of only at the end (#21659)
Closes https://github.com/ray-project/ray/issues/21517
2022-01-20 15:21:07 -06:00
mwtian
8cc268096c
[GCS][Bootstrap 3/n] Refactor to support GCS bootstrap (#21295)
This PR refactors several components to support switching to GCS address bootstrapping later:
- Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address.
- Ray client servers support `--address` flag instead of `--redis-address`.
- A few other miscellaneous cleanup.

Also, add a test for starting non-head node with `ray start`.
2022-01-03 23:52:12 -08:00
mwtian
20ca1d85c2
[GCS][Bootstrap 2/n] Fix tests to enable using GCS address for bootstrapping (#21288)
This PR contains most of the fixes @iycheng made in #21232, to make tests pass with GCS bootstrapping by supporting both Redis and GCS address as the bootstrap address. The main change is to use address_info["address"] to obtain the bootstrap address to pass to ray.init(), instead of using address_info["redis_address"]. In a subsequent PR, address_info["address"] will return the Redis or GCS address depending on whether using GCS to bootstrap.
2021-12-29 19:25:51 -07:00
mwtian
06ec07057c
Revert "[Core] Unrevert #21115, fix auto address env (#21158)" (#21189)
This reverts commit 968f08607b.

It is breaking e2e tests where worker nodes cannot start. e.g.

```
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1961, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 733, in start
    address_ip, password=redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 593, in create_redis_client
    _, redis_ip_address, redis_port = validate_bootstrap_address(redis_address)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 494, in validate_bootstrap_address
    raise ValueError("Malformed address. Expected '<host>:<port>'.")
ValueError: Malformed address. Expected '<host>:<port>'.
```
2021-12-20 00:22:12 -08:00
Clark Zinzow
968f08607b
[Core] Unrevert #21115, fix auto address env (#21158)
This PR unreverts #21115, fixing the handling of an `"auto"` address in the `RAY_ADDRESS` environment variable.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2021-12-18 07:45:00 -08:00
Chen Shen
d99f699e3d
Revert "[Core][GCS] Use port and address flags to configure GCS server / client in GCS bootstrapping mode (#21115)" (#21157)
This reverts commit 0e7c0b491b.
2021-12-17 11:48:40 -08:00
mwtian
0e7c0b491b
[Core][GCS] Use port and address flags to configure GCS server / client in GCS bootstrapping mode (#21115)
This change adds support for parsing `--address` as bootstrap address, and treating `--port` as GCS port, when using GCS for bootstrapping.

Not launching Redis in GCS bootstrapping mode, and using GCS to fetch initial cluster information, will be implemented in a subsequent change.

Also made some cleanups.
2021-12-16 15:11:05 -08:00
Jiao
ed34434131
[Jobs] Add log streaming for jobs (#20976)
Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration.

Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-12-14 17:01:53 -08:00
Edward Oakes
d26c9e67e8
[job submission] Add a message to the JobStatus to return more detailed errors (#20491) 2021-11-18 10:15:23 -06:00
Edward Oakes
eae523159f
[job submission] Prefix job ID with raysubmit_ and pass job_name metadata (#20490) 2021-11-17 21:48:22 -06:00
Edward Oakes
48bc1af2da
[job submission] Remove DOES_NOT_EXIST status (#20354) 2021-11-15 16:57:32 -08:00
Edward Oakes
81f036d078
[job submission] Move job_manager to dashboard module, common parts to common.py (#20209) 2021-11-10 14:14:55 -08:00
Renamed from python/ray/_private/job_manager/job_manager.py (Browse further)