Commit graph

6315 commits

Author SHA1 Message Date
shrekris-anyscale
bc82e2d5c4
[serve] Restore "[serve] Support working_dir in serve run (#22760)" (#22971) 2022-03-09 21:31:23 -08:00
Dmitri Gekhtman
19b4281991
[KubeRay] Pin autoscaler image (#22987)
Sets the autoscaler image to the one from this PR's commit.
#22847
2022-03-09 20:38:37 -08:00
Dmitri Gekhtman
413fe08f87
Move KubeRay autoscaler files into Ray autoscaler directory, add an entry-point. (#22847)
This PR consists of the following clean-up items for KubeRay autoscaler integration:

Remove the docker/kuberay directory

Move the Python files formerly in docker/kuberay to the autoscaler directory.

Use a rayproject/ray image for the autoscaler.

Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.

Slightly simplify the code that starts the autoscaler.

Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.

By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.

Add the autoscaler configuration test to the CI.

Update development documentation to reflect the changes in this PR.
2022-03-09 18:26:57 -08:00
Jiao
3546aabefd
[7/X][Pipeline] pipeline user facing build function (#22934) 2022-03-09 16:11:11 -08:00
Simon Mo
34ffc7e5cf
[Serve] [3/3 Wrappers] Add Model Wrapper with ray.ml (#22915) 2022-03-09 16:06:59 -08:00
Simon Mo
c844c706bf
[Serve] Use starlette public accessor for Request (#22957) 2022-03-09 13:25:03 -08:00
Jiao
ea9069fef4
[6/X][Pipeline] Add HTTP ingress to serve pipeline (#22878) 2022-03-09 11:39:15 -08:00
Simon Mo
3c4827e0b2
[Serve] [2/3 Wrappers] Add Basic HTTP Adapters (#22914) 2022-03-09 11:36:46 -08:00
Antoni Baum
2ead945438
[datasets] Make label_column optional in to_tf (#22916)
Makes the `label_column` argument in `Dataset.to_tf` optional so that it can be used for prediction.
2022-03-09 11:34:18 -08:00
shrekris-anyscale
61e132b478
[serve] Split test_deploy (#22908)
`test_deploy` has become [flakey](https://flakey-tests.ray.io/#) due to timeout. Since `test_deploy` is already a "large" test, this change splits it into two testing files instead of simply increasing the timeout.
2022-03-09 12:22:51 -06:00
Kai Fricke
b267be4758
[ml] Add Ray ML / AIR checkpoint implementation (#22691)
This PR splits up the changes in #22393 and introduces an implementation of the ML Checkpoint interface used by Ray Tune.

This means, the TuneCheckpoint class implements the to/from_[bytes|dict|directory|object_ref|uri] conversion functions, as well as more high-level functions to transition between the different TuneCheckpoint classes. It also includes test cases for Tune's main conversion modes, i.e. dict - intermediate - dict and fs - intermediate - fs.

These changes will be the basis for refactoring the tune interface to use TuneCheckpoint objects instead of TrialCheckpoints (externally) and instead of paths/objects (internally).
2022-03-09 10:02:59 -08:00
Eric Liang
79a3b56015
[ml] Improve the documentation of ml common classes; add kwargs to predictor (#22936) 2022-03-09 10:01:20 -08:00
Simon Mo
77ead01b65
[Serve] [1/3 Wrappers] Allow @serve.batch to accept args and kwargs (#22913) 2022-03-09 09:15:57 -08:00
Kai Fricke
15601ed79b
Revert "[serve] Support working_dir in serve run (#22760)" (#22956)
This reverts commit ab2741d64b.

The PR breaks ray job submission for anyscale:// URLs
2022-03-09 17:04:46 +00:00
Jiajun Yao
069f5f467c
[Test] Fix and enable test_logging.py (#22904)
Fix and enable test_logging.py
2022-03-09 09:01:38 -08:00
ZhuSenlin
a15890be58
[GCS] refactor the resource related data structures on the GCS (#22924)
* refactor resource data structure in gcs

* fix comment

* fix lint error

* fix

* DISABLED_TestRejectedRequestWorkerLeaseReply as it depends on the update of normal task

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-03-09 08:22:02 -08:00
matthewdeng
6b0169b23d
[ml] enable CI tests (#22926)
Follow-up to #22748, enabling tests in CI.

Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.

Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
2022-03-09 14:31:53 +00:00
Jialing He
795b5787dc
[runtime env][bug] Fix RuntimEnv ignore eager_install when _validate is True (#22935)
When _validate is True, RuntimeEnv will ignore field eager_install.
2022-03-09 20:16:55 +08:00
Siyuan (Ryans) Zhuang
b621dc099b
[DAG] Update the example in the doc (#22930)
* update doc
2022-03-08 20:09:45 -08:00
Guyang Song
56287d63e5
[runtime env] remove _rewrite_pip_list_ray_libraries (#22890)
We don't need this logic after using virtualenv.
2022-03-09 11:41:33 +08:00
Stephanie Wang
bf09f5071a
[core] Deflake test_plasma_unlimited (#22911)
test_plasma_unlimited::test_task_unlimited is flaky because one of the assertions is race-y and can trigger after the condition is no longer true (see #22883). This fixes the flake by:
- adding an assertion in between two object allocations to force the object store queue to flush
- keeping one of the ObjectRefs in scope to make sure that the object is still fallback-allocated by the time we reach the failing assertion
2022-03-08 22:00:04 -05:00
Junwen Yao
0395d0987e
[Train] Add support for automatic pipelining of host to device transfer (#22716)
This PR adds the support for concurrently transferring the input from host to device.
2022-03-08 18:37:23 -08:00
Balaji Veeramani
48af260aaf
[Train] Clarify shuffle documentation in prepare_data_loader (#22876)
We essentially use a hack to determine whether shuffling should be enabled in prepare_data_loader. I've clarified the documentation so the hack is easier to understand.
2022-03-08 18:13:29 -08:00
Eric Liang
52491c87e2
Make a pass fixing Dataset API issues (#22886) 2022-03-08 13:07:55 -08:00
shrekris-anyscale
ab2741d64b
[serve] Support working_dir in serve run (#22760)
#22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.
2022-03-08 13:18:41 -06:00
Junwen Yao
d1009c8489
[Train] Add support for metrics aggregation (#22099)
This PR allows users to aggregate metrics returned from all workers.
2022-03-08 11:03:04 -08:00
Balaji Veeramani
37c6169027
[Train] Refactor and add Accelerator classes (#22009)
To support mixed precision (see #20643), we need to store a GradScaler instance that is accessibly by both prepare_optimizer and backward functions (these functions will be added later).

This PR introduces the Accelerator, an object that implements methods to perform backend-specific training optimizations.
2022-03-08 10:26:00 -08:00
Balaji Veeramani
04b10ff9e9
[Train] Tell user to specify cluster address if placement group times out (#22845)
If you don't add `ray.init("auto")` to your training script, then your training script might complain that there aren't enough resources, even if `ray status` shows that there are.

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-08 10:24:12 -08:00
matthewdeng
7b5813e94f
[ml] add initial Dataset Preprocessors (#22748) 2022-03-08 09:59:03 -08:00
Gagandeep Singh
2899dc1bb5
Fixed MRO for DerivedActorClass (#22113)
Comments to be noted from the discussion below,

https://github.com/ray-project/ray/pull/22113#discussion_r802512907

> Problem - We cannot always delegate call to cls.__init__ or modified_cls.__init__. Because if always delegate call to cls.__init__ from here, then user defined class's __init__ method will be ignore leading to issues like, https://github.com/ray-project/ray/issues/21868. If we always delegate call to modified_cls.__init__ then it will allow inheriting from actor classes leading to failure of test_actor_inheritance. So, I have added this if-else check to figure out which __init__ method should be called. If "__module__", "__qualname__" and "__init__" are present in args[-1] then it would mean an actor class is being inherited so cls.__init__ should be called. However, if no such signal is received in args then user defined class's __init__ i.e., modified_class.__init__ should be called.

https://github.com/ray-project/ray/pull/22113#discussion_r808696261

> So I noted that ActorClass.__init__ will anyway raise a TypeError whenever it will be inherited. To exactly figure out whether the exception is due to inheritance of ActorClass, I created a new class ActorClassInheritanceException(TypeError). Now, whenever this will be raised, then DerivedActorClass will get a clear signal about inheritance of ActorClass. In other cases, it will be safe to conclude (AFAICT) that user called __init__ method of their class and we will proceed normally. IMHO, this is a better and more robust solution which just depends on a simple signal i.e., raising a particular exception in a specific event. It doesn't matter how inheritance is prevented as in the end we just need to raise ActorClassInheritanceException and all other code will be able to detect that easily.

https://github.com/ray-project/ray/pull/22113#issuecomment-1048527387
2022-03-08 09:37:19 -08:00
xwjiang2010
f5995dccdf
[tune] Trainables will now know TUNE_ORIG_WORKING_DIR (#22803)
Also updated the docs.
2022-03-08 15:56:30 +00:00
Jiajun Yao
7f57268bd0
Fix duplidate test bazel target (#22892) 2022-03-08 14:29:13 +09:00
Jiajun Yao
4801e57c77
[Test] Add missing tests to bazel BUILD (#22827) 2022-03-07 19:54:49 -08:00
Jian Xiao
c2908de401
For a dataset comprised of both empty and non-empty blocks, let the non-empty blocks determine the schema (#22834)
There is a bug in combining the results from map_batches: if we create two dataset out of the same data, but with different num of partitions, we may get different results when run the same map_batches() on them. That is, num of partitions is affecting the map_batches() results, which should not.
2022-03-07 18:17:49 -08:00
Jiajun Yao
2302b4eea8
Stop and join actor asyncio threads during exit (#22810) 2022-03-07 14:45:08 -08:00
Stephanie Wang
fa14120f93
Move tests out of test_object_spilling to de-flake (#22831)
This test is timing out often in debug_mode, so moved some tests to test_object_spilling_3.
2022-03-07 17:39:55 -05:00
SangBin Cho
79e8405fda
Revert "[GCS] refactor the resource related data structures on the GCS (#22817)" (#22863)
This reverts commit 549466a42f.
2022-03-07 08:48:17 -08:00
shrekris-anyscale
15d97a1021
[serve] Support init_args and init_kwargs in serve run (#22805)
#22714 added `serve run` to the Serve CLI. This change allows the user to specify `init_args` and `init_kwargs` in `serve run` if they are deploying via import path.
2022-03-07 09:45:17 -06:00
ZhuSenlin
549466a42f
[GCS] refactor the resource related data structures on the GCS (#22817) 2022-03-07 18:43:33 +08:00
shrekris-anyscale
2490b3e383
[serve] Enable serve-decorated deployment via import path (#22839)
Currently, classes and functions can be deployed by setting `Deployment`'s`func_or_class` to their import path. However, if these classes or functions are already decorated with `@serve.deployment`, the import path deployment will error.

This change instead ignores the settings in a class or function's `@serve.deployment` decorator when deploying via import path. It takes the code definition and deploys it without erroring. It also logs a warning about the ignored settings.
2022-03-06 20:03:57 -06:00
shrekris-anyscale
521298e093
[serve] Make route prefix the deployment name by default (#22840)
The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted.

This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access.

This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.
2022-03-06 20:03:31 -06:00
Jiao
2d2b5745ae
[5/X][Pipeline][Ray DAG] Make Ray InputNode more powerful with attr accessor (#22793)
- Enhanced ray dag InputNode to take arbitrary user input via `.execute()`.
  - If only one value is provided, like `dag.execute(1)`, return raw value;
  - Otherwise wrap user input into an `DAGInputData` object that can be accessed via index or key.
  - User can also pass list / dict object and just access them via index [0] or key ["key"]
- Introduced `InputAttrNode` that helps to connect partial attribute of user input to the DAG. 
- Added context manager syntax for `InputNode`.
- Add InputNode enforcements with tests, such as DAG level singleton, exception with messages, etc. 
- Enforce only simple int or str key
- Take care of JSON serialization for InputNode that carried original context manager info, ensure it's preserved.
- DAGNode UUID is also preserved in JSON serde.

## Next steps

On ray dag level we're proceeding with
```
with InputNode() as input:      # Probably better to rename it to DAGInput()
   a = Model.bind(input[0])
   b = Model.bind(input.x)
   dag = combine.bind(a, b)
```
But also enforces
   1) InputNode is always used in context manager as opposed to directly created
   2) There should be one and only one InputNode instance for each dag.
   3) No args passed by user to InputNode at ray dag level.

Then in serve we subclass a ServeInputNode() to enhance it like the following to support HTTP input validation and conversion:
```
with ServeInputNode(schema=MySchemaCls) as input:
   a = Model.bind(input[0])
   b = Model.bind(input.x)
   dag = combine.bind(a, b)
```

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2022-03-06 20:02:42 -06:00
Clark Zinzow
3d63313265
[Datasets] Batch across windows in DatasetPipelines. (#22830)
This PR allows `DatasetPipeline.iter_batches()` to batch data across windows in the pipeline. This prevents partial batches from popping up in the middle of consuming a dataset pipeline due to window boundaries, and now allows us to provide the following guarantee to the user: `pipe.iter_batches()` will yield `len(pipe) // batch_size` full batches, with a partial batch occurring only (1) as the final batch and (2) only if `len(pipe) % batch_size > 0`, and if it exists, will have size `len(pipe) % batch_size`.

The crux of this PR takes the block batching implementation from `Dataset.iter_batches()`, refactors it to operate on an iterator of blocks instead of a `Dataset` and pulls it out into a shared `batch_blocks()` utility, and have `DatasetPipeline.iter_batches()` use it to batch over windows by providing an iterator over all blocks in all windows.
2022-03-04 16:26:44 -08:00
Yi Cheng
5bbbfac5e8
[gcs] Fix resource updating incorrectly (#22644)
When there is no scheduling task of scheduling class in local raylet, the backlog resource will not be reported. It usually will happen when core worker try to schedule the task on other node and report backlog to local node.

This will lead to the wrong demands.
2022-03-04 14:32:54 -08:00
Yi Cheng
11bbf00338
[dashboard] Remove redis in dashboard (#22788)
As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.
2022-03-04 12:32:17 -08:00
Eric Liang
80aac655ca
Fix flaky metric test (#22809) 2022-03-03 20:44:50 -08:00
Siyuan (Ryans) Zhuang
d72350bfe6
[workflow] Fix different step directories are used for "workflow.wait" during recovery (#22782)
* add test
2022-03-03 16:37:50 -08:00
Jian Xiao
b933587597
Support map_groups in dataset (#22709)
Make Dataset capable of running map_groups(), i.e. apply a UDF on each group after a groupby() operation.
2022-03-03 15:14:00 -08:00
mwtian
55166f0780
Revert "Revert "Disable scheduler_report_pinned_bytes_only (#22132)" (#22786)" (#22808)
This reverts commit b98c9c77f1.
2022-03-03 12:32:28 -08:00
shrekris-anyscale
71a493cf1f
[serve] Add run, delete, and status to Serve CLI (#22714)
This change adds `run`, `delete`, and `status` commands to the CLI introduced in #22648.
* `serve run`: Blocking command that allows users to deploy a YAML configuration or a class/function via import path. When terminated, the deployment(s) is torn down. Prints status info while running. Supports interactive development.
* `serve delete`: Shuts down a Serve application and deletes all its running deployments.
* `serve status`: Displays the status of a Serve application's deployments.
2022-03-03 09:50:36 -06:00