Commit graph

6177 commits

Author SHA1 Message Date
Antoni Baum
d5284a740c
[tune] Remove Trainable.update_resources (#22471) 2022-02-25 08:38:34 -08:00
xwjiang2010
d4a1bc7bc7
Revert "[runtime env] runtime env inheritance refactor (#22244)" (#22626)
Breaks train_torch_linear_test.py.
2022-02-25 08:42:30 -06:00
shrekris-anyscale
e85540a1a2
[serve] Expose deployment statuses in REST API (#22611) 2022-02-25 08:41:07 -06:00
Dmitri Gekhtman
b2b442297e
[autoscaler] Fix initialization artifacts (#22570)
This PR fixes initializations artifacts related to the load metric summary and autoscaler summary.

Load metrics summaries are defined to be Falsey if the autoscaler has never received a resource message from the GCS.
We skip most autoscaler actions if load metrics is Falsey, because it doesn't makes sense to autoscale without load metrics. This also allows us to execute the TODO here: #22348 (comment) and remove the time.wait().

As for the autoscaler summary, it is possible for autoscaler.summary() to error outside of an autoscaler update in this scenario:
The very first call to NodeProvider.non_terminated_nodes fails, self.non_terminated_nodes remains a None object, and autoscaler.summary() fails trying to get an attribute of this None object.
The result is a confusing error message, as in #22515. This PR fixes that.

Closes #22515
2022-02-24 20:05:44 -08:00
Simon Mo
bfb619a127
[xlang] Allow Python to call overloaded methods with differing number of parameters (#21410) 2022-02-24 16:51:38 -08:00
Jiao
3c707f70cc
[2/X][Pipeline] Add python generation for ClassNode (#22617)
- Added backbone of ray dag -> serve dag transformation and deployment extraction.
- Added util functions for deployment unique name generation .. ray_actor_options, replacement of DeploymentNode with deployment handle, etc.
2022-02-24 16:01:35 -06:00
Eric Liang
533a0440a6
Improve actor pool support in Datasets (#22574) 2022-02-24 12:01:36 -08:00
Amog Kamsetty
02cb974c6c
[Train] Fix fault tolerance for Tensorflow (#22508)
Soft restarts don't work for tensorflow since there is still some leftover communication state in the actors which may lead to undefined behavior, such as causing training to hang.

Instead, this PR changes the failure handling for tensorflow to match torch and horovod, and recreates all the workers in case of failure. Also adds a test to check if fault tolerance works correctly for an actual tensorflow example. When testing locally, the test failed before the change, but passes after.
2022-02-24 11:50:20 -08:00
Simon Mo
b8c28d1f2b
[Tune] Make sure tune.run can run inside worker thread (#22566) 2022-02-24 09:50:42 -08:00
shrekris-anyscale
a9ede4e499
[serve] Add REST API (#22578)
This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.
2022-02-24 10:00:26 -06:00
Liu Bao
6a9a28612c
[runtime env] Async pip runtime env (#22381)
In order to initialize runtime env concurrently, this PR makes pip runtime env asynchronous. It includes,

- [x] New `check_output_cmd` in runtime env utils.
- [x] Async PipProcessor.
- [x] The `asynccontextmanager` from `https://github.com/python-trio/async_generator` for Python 3.6
- [x] Remove pip runtime env lock.
- [x] Disable pip cache.

Co-authored-by: 刘宝 <po.lb@antfin.com>
2022-02-24 11:03:40 +08:00
Eric Liang
e15a419028
Enable stage fusion by default for dataset pipelines (#22476)
This PR enables stage fusion for dataset pipelines. This also requires:
1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage.
2. Removing spread_resource_prefix (not supported for now).
2022-02-23 17:34:05 -08:00
Eric Liang
a62a9c38fb
Fix [Bug] Splitting Dataset when shards < n hangs or errors (#22559) 2022-02-23 15:54:25 -08:00
Edward Oakes
5a21289a34
[runtime_env] Remove get_current_runtime_env from docs (#22594)
We should just encourage people to use the existing `get_runtime_context` API instead of introducing a new one here. Just removing the docs for now while we discuss this.
2022-02-23 16:53:52 -06:00
Eric Liang
fc75d17701
Fix [Bug] DatasetPipeline .iter_epochs() can lead to infinite loops (#22572) 2022-02-23 13:35:31 -08:00
Siyuan (Ryans) Zhuang
f6f0fea102
Symlink workflow for development (#22554) 2022-02-23 13:14:05 -08:00
Siyuan (Ryans) Zhuang
2e0186a5b6
[workflow] Checkpoint API (#19406)
checkpoint API

* ensure commit_step only do checkpointing
2022-02-23 13:09:08 -08:00
Chris K. W
3371e78d2e
[client] Chunk PutRequests (#22327)
Why are these changes needed?
Data from PutRequests is chunked into 64MiB messages over the datastream, to avoid the 2GiB message size limit from gRPC. This will allow users to transfer objects larger than 2GiB over the network.

Proto changes
Put requests now have fields for chunk_id to identify which chunk data belongs to, total_chunks to identify the total number of chunks in the object, and total_size for total size in bytes of the object (useful for raising warnings).

PutObject is still unary-unary. The dataservicer handles reassembling the chunks before passing the result to the underlying RayletServicer.

Dataclient changes
If a put request is inserted into the request queue, self._requests will chunk it lazily. Doing this lazily is important since inserting all of the chunks onto the request queue immediately would double the amount of memory needed to handle a large request. This also guarantees that the chunks of a given putrequest will be contiguous

Dataservicer changes
The dataservicer now maintains some state to track received chunks. Once all chunks for a putrequest are received, the combined chunks are passed to the raylet servicer.
2022-02-23 18:21:25 +02:00
Jiao
a20748f83a
[1/X][Pipeline] Add deployment nodes (#22549)
Ray DAG Changes
- Restructured and resolves circular imports in current dag_node.py. 
- Moved `__str__` to each DAGNode subclass level with centralized utils imports
- Removed restrictions on binding `InputNode` to `FunctionNode` and `ClassMethodNode`
- Moved `_contain_input_node` to only `ClassNode` and `DeploymentNode`

Serve DAG Changes
- Added DeploymentNode
  - Cannot be directly constructed
  - Holds deployment func or class body as well as handle that trivially maps to `__call__` method (match current behavior)
  - Upon accessing an attribute, it will spawn DeploymentMethodNode node with `other_args_to_resolve` passed in to differentiate sync handle type and others
- Added DeploymentMethodNode
  - Holds arg and deployment handle
  - Executing on it translate to deployment handle call on the method.
2022-02-23 09:56:24 -06:00
Jiajun Yao
82443aec63
Remove DEFAULT_SCHEDULING_STRATEGY and SPREAD_SCHEDULING_STRATEGY (#22558) 2022-02-22 21:34:21 -08:00
Stephanie Wang
abf2a70a29
[core] Add task and object reconstruction status to ray memory (#22317)
Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be:
```
  // The task is waiting for its dependencies to be created.
  WAITING_FOR_DEPENDENCIES = 1;
  // All dependencies have been created and the task is scheduled to execute.
  SCHEDULED = 2;
  // The task finished successfully.
  FINISHED = 3;
```

In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:
```
IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref
192.168.4.22  | 279475   | Driver  | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000


```
2022-02-22 21:26:21 -08:00
shrekris-anyscale
40fa56f40c
[serve] Add JSON schemas for REST API (#22547) 2022-02-22 21:36:42 -06:00
mwtian
9a157dfe82
[GCS-Ray] update doc and error message for GCS-Ray (#22528)
Update documentation to reflect that Ray no longer starts Redis by default.
2022-02-22 17:56:30 -08:00
Eric Liang
12dcec8b38
Fix [Datasets] iter_epochs not iterating using native format 2022-02-22 15:47:16 -08:00
SangBin Cho
36a31cb6fd
[Usage Stats] Implement usage stats report "Turned off by default". (#22249)
This is the second PR to implement usage stats on Ray. Please refer to the file usage_lib.py for more details.

The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj.

This adds a dashboard module to enable usage stats. **Usage stats report is turned off by default** after this PR. We can control the report (enablement, report period, and URL. Note that URL is strictly for testing) using the env variable.  

## NOTE
This requires us to add `requests` to the default library. `requests` must be okay to be included because
1. it is extremely lightweight. It is implemented only with built-in libs.
2. It is really stable. The project basically claims they are "deprecated", meaning no new features will be added there.

cc @edoakes @richardliaw for the approval

For the HTTP request, I was alternatively considered httpx, but it was not as lightweight as `requests`. So I decided to implement async requests using the thread pool.
2022-02-22 15:32:02 -08:00
Antoni Baum
a1230b9291
[tune] Note TPESampler performance issues in docs (#22545) 2022-02-22 15:29:12 -08:00
Edward Oakes
58e5f0140d
[jobs] Rename JobData -> JobInfo (#22499)
`JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.
2022-02-22 16:18:16 -06:00
Yi Cheng
e3051ebf67
[ci] Fix grpcio 1.44 break test_output (#22494)
This PR limit grpc to be <= 1.42. This will fix testoutput.
2022-02-22 13:59:25 -08:00
Dmitri Gekhtman
a402e956a4
[KubeRay] Format autoscaling config based on RayCluster CR (#22348)
Closes #21655. At the start of each autoscaler iteration, we read the Ray Cluster CR from K8s and use it to extract the autoscaling config.
2022-02-22 11:06:37 -08:00
Antoni Baum
4a15c6f8f3
[tune] Preparation for deadline schedulers (#22006) 2022-02-22 11:05:28 -08:00
Matti Picus
dfe4706d73
re-remove unused opencv-python-headless (#22470)
PR #16929 removed opencv-python-headless.
PR #17158 added it back but did not use it. This was noted by [a reviewer](https://github.com/ray-project/ray/pull/17158#issuecomment-982976429) since it breaks python3.9 (no wheel is available for installation).
2022-02-22 09:45:30 -08:00
Gagandeep Singh
4de1886ad5
Unskipped tests in test_object_spilling, test_object_spilling_2, test_get_locations (#22208)
Mostly cluster tests are enabled in this PR in the above mentioned files. Some non-cluster tests are also enabled. All of these pass on my machine without issues.
2022-02-22 09:41:26 -08:00
Guyang Song
5783cdb254
[runtime env] runtime env inheritance refactor (#22244)
Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc):
- The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment.
- Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent.

We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run.

Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior:
- **If there is no runtime env option when we create actor, inherit the parent runtime env.**
- **Otherwise, use the optional runtime env directly and don't do the merging.**

Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like:
```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))```
This new API also can be used in ray client.
2022-02-21 18:13:22 +08:00
Gagandeep Singh
3cb85859cd
Unskipped tests for Windows (#21702)
This set of tests passes without issues on Windows for me, so unskipping them here.
2022-02-20 11:48:59 -08:00
Clark Zinzow
76e8247d4d
[Datasets] Force local metadata resolution when unserializable Partitioning object provided. (#22477) 2022-02-18 21:21:34 -08:00
Amog Kamsetty
04feea4afe
[rllib] Upper bound gym version (#22510)
gym had 0.22 release today which is breaking a lot of the rllib tests and examples. Temporarily pins gym version for now.
2022-02-18 17:39:22 -08:00
Jiajun Yao
6a17653ba7
API stability annotations for ray commands (#22420)
Annotate ray commands that are intended to be public.
2022-02-18 17:13:36 -08:00
Guyang Song
57a94aae12
[runtime env][bugfix] Fix runtime env retry (#22495)
- Bug: `error_message` is not cleared when the retry succeeds. This bug lead to runtime env creation failing.
- Add test case for this.
2022-02-18 17:09:06 -08:00
Jiajun Yao
baa14d695a
Round robin during spread scheduling (#21303)
- Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently.
- Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later.
- Prefer not to spill back tasks that are waiting for args since the pull is already in progress.
2022-02-18 15:05:35 -08:00
mwtian
5a4c6d2e88
[Core] release GIL when running parallel_memcopy() / memcpy() during serializations (#22492)
While investigating #22161, it is observed GIL is held for an extended amount of time (up to 1000s) with stack trace [1]. It is possible either there are many iterations within `Pickle5Writer.write_to()` calling `ray::parallel_memcopy()`, or a few `ray::parallel_memcopy()` taking a long time (less likely). Either way, `ray::parallel_memcopy()` or `std::memcpy()` should not hold GIL.
2022-02-18 14:11:12 -08:00
Stephanie Wang
03a5589591
[core] Enable lineage reconstruction in CI (#21519)
Enables lineage reconstruction in all CI and release tests.
2022-02-18 11:04:20 -08:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Archit Kulkarni
1f160114a0
[serve] [CI] change serve:test_runtime_env from medium to large (#22474)
This test was timing out occasionally.
2022-02-18 08:50:47 -06:00
Archit Kulkarni
df85d31095
[Serve] Make handle serializable (#22473) 2022-02-17 17:29:44 -08:00
Ian Rodney
c9a4b17f99
[YAMLs] Fix comments about autoscaler round-robining (#22002) 2022-02-17 13:59:05 -08:00
SangBin Cho
4ecb2afc2c
[State] Add pid to the actor table data. (#22434)
It is requested by users that they'd like to get the pid of actors using ray.state.actors. This PR addresses that.
2022-02-17 06:22:29 -08:00
Eric Liang
786c5759de
[data] Stage fusion optimizations, off by default (#22373)
This PR adds the following stage fusion optimizations (off by default). In a later PR, I plan to enable this by default for DatasetPipelines.
- Stage fusion: Whether to fuse compatible OneToOne stages.
- Read stage fusion: Whether to fuse read stages into downstream OneToOne stages. This is accomplished by rewriting the read stage (LazyBlockList) into a transformation over a collection of read tasks (BlockList -> MapBatches(do_read)).
- Shuffle stage fusion: Whether to fuse compatible OneToOne stages into shuffle stages that support specifying a map-side block UDF.

Stages are considered compatible if their compute strategy is the same ("tasks" vs "actors"), and they have the same Ray remote args. Currently, the PR is ignoring the remote args of read tasks, but this will be fixed as a followup (I didn't want to change the read tasks default here).
2022-02-16 21:08:27 -08:00
Yi Cheng
e10a2fbcf9
[workflow] Move test_basic_workflows_2.py to large test (#22416)
test_basic_workflows_2.py timeout. Move it to the large test suite.
2022-02-16 17:05:02 -08:00
Yi Cheng
83257a4193
Revert "[Client] chunked get requests" (#22455)
Reverts ray-project/ray#22100

linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.
2022-02-16 16:43:43 -08:00
Chen Shen
30ec0df9cc
[placement group] fix pg benchmark regression #22441
We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.
2022-02-16 16:24:51 -08:00