Commit graph

11349 commits

Author SHA1 Message Date
Stephanie Wang
03a5589591
[core] Enable lineage reconstruction in CI (#21519)
Enables lineage reconstruction in all CI and release tests.
2022-02-18 11:04:20 -08:00
Max Pumperla
9482f03134
[docs] RLlib concepts consolidation, user guide, RL conf prep (#22496) 2022-02-18 09:35:20 -08:00
Jun Gong
04effca29c
[RLlib; docs] Update README.rst to fix the broken RLlib logo (#22489) 2022-02-18 18:33:07 +01:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Archit Kulkarni
1f160114a0
[serve] [CI] change serve:test_runtime_env from medium to large (#22474)
This test was timing out occasionally.
2022-02-18 08:50:47 -06:00
ZhuSenlin
3341fae573
[Core] remove unused method GcsResourceManager::UpdateResourceCapacity (#22462)
In the implementation of `GcsResourceManager::UpdateResourceCapacity`, 'cluster_scheduling_resources_'  is modified,  but this method is only used in c++ unit test, it is easy to cause confuse when reading the code. Since this method can be completely replaced by `GcsResourceManager::OnNodeAdd`, just remove it.

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-02-18 13:35:47 +08:00
Archit Kulkarni
df85d31095
[Serve] Make handle serializable (#22473) 2022-02-17 17:29:44 -08:00
ZhuSenlin
15cccd0286
[Core] Fix null pointer crash when GcsResourceManager::SetAvailableResources (#22459)
* fix null pointer crash when GcsResourceManager::SetAvailableResources

* add warning log when node does not exist

* add unit test

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-02-17 17:18:30 -08:00
Simon Mo
3e7511e84f
[CI] Disable privileged test (#22484) 2022-02-17 15:34:02 -08:00
Chen Shen
17f589a05d
[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479 2022-02-17 15:20:39 -08:00
Ian Rodney
c9a4b17f99
[YAMLs] Fix comments about autoscaler round-robining (#22002) 2022-02-17 13:59:05 -08:00
Sven Mika
c58cd90619
[RLlib] Enable Bandits to work in batches mode(s) (vector envs + multiple workers + train_batch_sizes > 1). (#22465) 2022-02-17 22:32:26 +01:00
SangBin Cho
4ecb2afc2c
[State] Add pid to the actor table data. (#22434)
It is requested by users that they'd like to get the pid of actors using ray.state.actors. This PR addresses that.
2022-02-17 06:22:29 -08:00
Avnish Narayan
740def0a13
[RLlib] Put env-checker on critical path. (#22191) 2022-02-17 14:06:14 +01:00
Sven Mika
e03606f0b3
[RLlib] Bandit documentation enhancements. (#22427) 2022-02-17 13:25:50 +01:00
Chen Shen
ab53848dfc
[refactor cluster-task-manage 4/n] refactor cluster_task_manager into distributed and local part (#21660)
This is a working in progress PR that splits cluster_task_manager into local and distributed parts.

For the distributed scheduler (cluster_task_manager_:
/// Schedules a task onto one node of the cluster. The logic is as follows:
/// 1. Queue tasks for scheduling.
/// 2. Pick a node on the cluster which has the available resources to run a
/// task.
/// * Step 2 should occur any time the state of the cluster is
/// changed, or a new task is queued.
/// 3. For tasks that's infeasible, put them into infeasible queue and reports
/// it to gcs, where the auto scaler will be notified and start new node
/// to accommodate the requirement.

For the local task manager:

/// It Manages the lifetime of a task on the local node. It receives request from
/// cluster_task_manager (the distributed scheduler) and does the following
/// steps:
/// 1. Pulling task dependencies, add the task into to_dispatch queue.
/// 2. Once task's dependencies are all pulled locally, the task becomes ready
/// to dispatch.
/// 3. For all tasks that are dispatch-ready, we schedule them by acquiring
/// local resources (including pinning the objects in memory and deduct
/// cpu/gpu and other resources from local resource manager), and start
/// a worker.
/// 4. If task failed to acquire resources in step 3, we will try to
/// spill it to a different remote node.
/// 5. When a worker finishes executing its task(s), the requester will return
/// it and we should release the resources in our view of the node's state.
/// 6. If a task has been waiting for arguments for too long, it will also be
/// spilled back to a different node.
///
2022-02-17 01:14:33 -08:00
Eric Liang
786c5759de
[data] Stage fusion optimizations, off by default (#22373)
This PR adds the following stage fusion optimizations (off by default). In a later PR, I plan to enable this by default for DatasetPipelines.
- Stage fusion: Whether to fuse compatible OneToOne stages.
- Read stage fusion: Whether to fuse read stages into downstream OneToOne stages. This is accomplished by rewriting the read stage (LazyBlockList) into a transformation over a collection of read tasks (BlockList -> MapBatches(do_read)).
- Shuffle stage fusion: Whether to fuse compatible OneToOne stages into shuffle stages that support specifying a map-side block UDF.

Stages are considered compatible if their compute strategy is the same ("tasks" vs "actors"), and they have the same Ray remote args. Currently, the PR is ignoring the remote args of read tasks, but this will be fixed as a followup (I didn't want to change the read tasks default here).
2022-02-16 21:08:27 -08:00
Yi Cheng
e10a2fbcf9
[workflow] Move test_basic_workflows_2.py to large test (#22416)
test_basic_workflows_2.py timeout. Move it to the large test suite.
2022-02-16 17:05:02 -08:00
mwtian
05dd72101b
[Release 1.11.0] Release logs for 1.11.0rc1 (#22443)
This is the release log for 1.11.0rc1, with GCS-Ray enabled. The diff is against 1.11.0rc0, without GCS-Ray.
2022-02-16 17:03:49 -08:00
Yi Cheng
83257a4193
Revert "[Client] chunked get requests" (#22455)
Reverts ray-project/ray#22100

linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.
2022-02-16 16:43:43 -08:00
Chen Shen
30ec0df9cc
[placement group] fix pg benchmark regression #22441
We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.
2022-02-16 16:24:51 -08:00
Jun Gong
a9147bb62c
[Release Test] Fix AnyscaleSDK construction so we can run CI on staging instance. (#22325) 2022-02-16 09:56:02 -08:00
SangBin Cho
42361a1801
[Test] Fix Dask on Ray 1 TB bug #22431 Open
Fixes a bug. It seems like not df is not working with dataframe
2022-02-17 02:44:36 +09:00
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00
SangBin Cho
2ed5bb7a5f
[Nightly Test] Addressed client failure properly (#22438)
When the client returns the code that's not 0, we should raise RuntimeError to properly propagate errors
2022-02-16 09:03:17 -08:00
Archit Kulkarni
606e2b2cde
Update license for MLflow's conda utils and virtualenv-clone (#22402)
When we vendor third-party code, we should update LICENSE file.  Previously we vendored two pieces of code:
- conda utilities from MLflow
- virtualenv-clone
But we only included the attribution in the relevant source files, not in our LICENSE file.  This PR adds the necessary info to our LICENSE file.
2022-02-16 10:00:23 -06:00
Jun Gong
04dd536987
[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426) 2022-02-16 13:06:09 +01:00
Hao Chen
f2bbcf5adc
Fix test_traceback incompatibility with pytest 6.x (#22375)
Co-authored-by: Kai Yang <kfstorm@outlook.com>

Co-authored-by: Kai Yang <kfstorm@outlook.com>
2022-02-16 18:04:19 +08:00
Qing Wang
7c45d1a366
[doc][Java] Add doc page for java concurrency group. (#21600)
Add document page for Java concurrency group.

Co-authored-by: Kai Yang <kfstorm@outlook.com>
2022-02-16 17:57:03 +08:00
Eric Liang
92550500bc
Split workflow and dataset tests (#22415) 2022-02-16 01:47:55 -08:00
Archit Kulkarni
63a5eb492d
Revert "[serve] Add basic REST API to dashboard (#22257)" (#22414)
This reverts commit f37f35c5da.
2022-02-15 21:47:50 -06:00
Eric Liang
10172d8663
Add more codeowners to datasets (#22409) 2022-02-15 15:44:20 -08:00
mwtian
839bc5019f
Fix building Windows wheels (#22388) (#22391)
This fixes Windows wheel build issue on master and releases/1.11.0 branch. If the issue happens more often we can try to run iwyu.
2022-02-15 15:24:10 -08:00
Eric Liang
2158df3a73
[data] Pre-reqs for implementing stage fusion (#22374) 2022-02-15 14:59:07 -08:00
mwtian
32035eb125
[Pubsub] increase subscriber timeout (#22394)
As mentioned in https://github.com/ray-project/ray/issues/22161#issuecomment-1039661368, increase subscriber timeout to avoid subscriber state being deleted too soon.
2022-02-15 14:48:19 -08:00
Chris K. W
9a7979d9a2
[Client] chunked get requests (#22100)
Why are these changes needed?
Switches GetObject from unary-unary to unary-streaming so that large objects can be streamed across multiple messages (currently hardcoded to 64MiB chunks). This will allow users to retrieve objects larger than 2GiB from a remote cluster. If the transfer is interrupted by a recoverable gRPC error (i.e. temporary disconnect), then the request will be retried starting from the first chunk that hasn't been received yet.

Proto changes
GetRequest's now have the field start_chunk_id, to indicate which chunk to start from (useful if the we have to retry a request after already receiving some chunks). GetResponses now have a chunk_id (0 indexed chunk of the serialized object), total_chunks (total number of chunks, used in async transfers to determine when all chunks have been received), and total_size (the total size of the object in bytes, used to raise user warnings if the object being retrieved is very large).

Server changes
Mainly just updating GetObject logic to yield chunks instead of returning

Client changes
At the moment, objects can be retrieved directly from the raylet servicer (ray.get) or asynchronously over the datapath (await some_remote_func.remote()). In both cases, the request will error if the chunk isn't valid (server side error) or if a chunk is received out of order (shouldn't happen in practice, since gRPC guarantees that messages in a stream either arrive in order or not at all).

ray.get is fairly straightforward, and changes are mainly to accommodate yielding from the stub instead of taking the value directly.

await some_remote_func.remote() is similar, but to keep things consistent with other async handling collecting the chunks is handled by a ChunkCollector, which wraps around the original callback.
2022-02-16 00:07:16 +02:00
Edward Oakes
f37f35c5da
[serve] Add basic REST API to dashboard (#22257) 2022-02-15 15:36:58 -06:00
Edward Oakes
9c07eabab9
[serve] Remove unused filter_tag and errant str redefinition (#22400) 2022-02-15 15:33:10 -06:00
Eric Liang
df4b56d32e
[minor] Fix dataset shuffle bug on empty blocks. (#22367)
There's an edge case where we can crash if empty blocks end up in shuffle (type gets inferred as Arrow, then fails when we add list-type blocks).
2022-02-15 13:18:54 -08:00
SangBin Cho
6eace8a305
[Test] Change the default encoding to utf-8 (#22286)
Follow up - https://github.com/ray-project/ray/pull/22248#pullrequestreview-878073629
2022-02-15 11:35:48 -08:00
Jialing He
4c73560b31
[runtime env] Support clone virtualenv from an existing virtualenv (#22309)
Before this PR, we can't run ray in virtualenv, cause `runtime_env` does not support create a new virtualenv  from an existing virtualenv.

More details:https://github.com/ray-project/ray/pull/21801#discussion_r796848499

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-02-15 12:51:01 -06:00
Chen Shen
4ad1fba100
[refactor cluster-task-manage 3/n] Separate stats reporting into its own file (#22359)
* wip

* refactor
2022-02-15 10:48:00 -08:00
Simon Mo
495221e7d2
[Doc] Update Serve logo for tune user guide (#22369)
We have deprecated the old logo.
2022-02-15 12:10:08 -06:00
Hao Chen
78597d3089
[train] Minor fixes on Ray Train user guide doc (#22379)
Fixes some typos and format issues.
2022-02-15 10:09:27 -08:00
Matti Picus
199bf558e2
move slow test from small (timeout 60s) to medium (timeout 300s) (#22167) 2022-02-15 09:55:30 -08:00
Gagandeep Singh
7dc097a947
Unskipped tests for Windows (#21809)
These tests are passing without issues on my Windows machine, so unskipping them to check on CI.
I will push the linting changes separately to execute the test suite twice for confirming that flakyness is removed.

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2022-02-15 09:04:47 -08:00
Gagandeep Singh
a8341dfc29
Replace queue.Queue with multiprocessing.JoinableQueue (#21860)
Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300
And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121

AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.
2022-02-15 09:01:17 -08:00
ZhuSenlin
37ef372a10
Use shared_ptr to instead of object in cluster_scheduling_resources_ to reduce rehash cost. (#22376)
1. In scheduling optimization, we should encapsulate `SchedulingResources`, `GcsNodeInfo` and other node related information into a `NodeContext` for use, which requires that `SchedulingResources` is shareable. This PR does not involve the transformation logic of `NodeContext`, but only transforms `SchedulingResources` into shareable.
2. `cluster_scheduling_resources_` holds raw object of `SchedulingResources`, which will bring some overhead when rehash (even though the std::move used when rehash).

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-02-15 23:43:59 +08:00
Kai Fricke
c866131cc0
[tune] Retry cloud sync up/down/delete on fail (#22029) 2022-02-15 12:27:29 +00:00
Jun Gong
b729a9390f
[RLlib] Add example commands for using setup-dev.py with RLlib for improved dev setup stability and developer experience. (#22380) 2022-02-15 12:00:36 +01:00