Commit graph

6993 commits

Author SHA1 Message Date
clarng
1a5f42742d
import sort rest of autoscaler (#25796)
Continue to import sort the rest of autoscaler.
2022-06-15 15:00:21 -07:00
Archit Kulkarni
23030dbcaa
[runtime env] Hide URI cache behind class (#24622)
Followup PR to https://github.com/ray-project/ray/pull/20273.

- Hides cache logic behind a class.
- Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class. 

Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.
2022-06-15 16:14:06 -05:00
Antoni Baum
090024c297
[AIR] Fix FailureConfig not being a dataclass (#25807) 2022-06-15 13:46:51 -07:00
Robert
b4d85a2c8a
[RuntimeEnv] Fixes spaces in paths causing failures on Windows (#25659)
This is a follow-up to the previous PR (GitHub did some funky things when I did a rebase, so I had to create a new one)

On Windows systems, the `exec_worker` method may fail due to spaces being present in arguments that are file paths. This addresses said issue.
2022-06-15 15:22:17 -05:00
Clark Zinzow
526e12074a
[Datasets] Make it clear that read_parquet() does not support multiple directories. (#25747)
Unfortunately, ray.data.read_parquet() doesn't work with multiple directories since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html

This PR makes this clear in the docs, and as a driveby, adds ray.data.read_parquet_bulk() to the API docs.
2022-06-15 13:19:39 -07:00
Ian Rodney
7800172041
[AWS] Cleanup Naming/Typing of Boto3 resources/clients (#25731)
It's a bit hard to follow if these are clients or resources so add typing + rename a mis-named function.
2022-06-15 11:57:20 -07:00
Chen Shen
8982e4d78c
Revert "[Ray Dataset] fix the type infer of pd.dataframe (when dtype is object)" (#25809)
This reverts commit f61f60f708.
2022-06-15 11:20:14 -07:00
Stephanie Wang
68be44ade1
[datasets] Avoid unnecessary metadata serialization in Datasets shuffle (#25734)
Push-based shuffle has some extra metadata involving merge and reduce tasks. Previously we were serializing an O(n) (n = reduce tasks) metadata and sending this to tasks, which caused a lot of unnecessary plasma usage on the head node. This PR splits up the metadata into parts that can be kept on the driver and a relatively cheap part that is sent to all tasks.
Related issue number

One of the issues needed for #24480.
2022-06-15 10:33:52 -07:00
Jimmy Yao
f61f60f708
[Ray Dataset] fix the type infer of pd.dataframe (when dtype is object) 2022-06-15 08:11:49 -07:00
xwjiang2010
88d824d067
[air] remove fully_executed from Tune. (#25750) 2022-06-14 22:32:48 -07:00
Chen Shen
4ecfa9374d
Revert "[Ray Dataset] fix the type infer of pd.dataframe (when dtype is object.) (#25563)" (#25790)
This reverts commit 57d02eec2e.
2022-06-14 20:46:40 -07:00
Antoni Baum
11c556f887
[Train] Remove bad arg from SklearnTrainer doc (#25773)
Removes docstring for an argument that is not present. Looks like it was introduced by mistake.
2022-06-14 19:29:49 -07:00
shrekris-anyscale
a371756b3c
[Serve] Update Serve CLI and REST API behavior to use new config (#25691) 2022-06-14 19:01:51 -07:00
clarng
badf444eda
Respect import order for psutil and setproctitle (#25780)
Sort imports in a way that preserves the ordering requirements. This PR is needed for any file changes that imports psutil or setproctitle.
2022-06-14 17:44:41 -07:00
Antoni Baum
067a244c84
[AIR] Arrow support for preprocessors (#25623)
Adds a _transform_arrow method to Preprocessors that allows them to implement logic for arrow-based Datasets.

- If only _transform_arrow is implemented, will convert the data to arrow.
- If only _transform_pandas is implemented, will convert the data to pandas.
- If both are implemented, will pick the method corresponding to the format for best performance.
Implementation is defined as overriding the method in a sub-class.

This is only a change to the base Preprocessor class. Implementations for sub-classes will come in the future.
2022-06-14 16:48:31 -07:00
Jimmy Yao
5f6f2d9f29
[AIR] Tf end2end CV example (#25070) 2022-06-14 16:24:38 -07:00
Sihan Wang
d4aa7691e9
[Serve] Add compact for InMemoryMetricsStore max function (#25770) 2022-06-14 13:09:30 -07:00
Jimmy Yao
57d02eec2e
[Ray Dataset] fix the type infer of pd.dataframe (when dtype is object.) (#25563)
this is a temp fix of #25556. When the dtype from the pandas dataframe gives object, we set the dtype to be None and make use of the auto-inferring of the type in the conversion.
2022-06-14 12:49:04 -07:00
Archit Kulkarni
0d8cbb1cae
[runtime env] Skip content hash for unopenable files (#25413) 2022-06-14 12:07:51 -07:00
Matti Picus
e5c5275bed
[Runtime Env] enable conda runtime creation in workers on windows (#23613) 2022-06-14 10:24:02 -07:00
Sihan Wang
c92628138b
[Serve] Disable background thread of handle without autoscaling (#25733) 2022-06-14 10:04:27 -07:00
Mark
1feb702327
Create RAY_TMPDIR if it doesn't exist (#25577)
This will prevent FileNotFoundErrors on fresh `ray up` local node provider installs.

Co-authored-by: Mark Flanagan <>
2022-06-14 09:11:31 -07:00
Kai Fricke
d5541cccb1
[air] Use predict_pandas in xgboost, lightgbm, rl, huggingface, sklearn (#25759)
Switching to the _predict_pandas API implementation for xgboost, lightgbm, rl, huggingface, and sklearn predictors.
2022-06-14 14:47:37 +02:00
Kai Fricke
6313ddc47c
[tune] Refactor Syncer / deprecate Sync client (#25655)
This PR includes / depends on #25709

The two concepts of Syncer and SyncClient are confusing, as is the current API for passing custom sync functions.

This PR refactors Tune's syncing behavior. The Sync client concept is hard deprecated. Instead, we offer a well defined Syncer API that can be extended to provide own syncing functionality. However, the default will be to use Ray AIRs file transfer utilities.

New API:
- Users can pass `syncer=CustomSyncer` which implements the `Syncer` API
- Otherwise our off-the-shelf syncing is used
- As before, syncing to cloud disables syncing to driver

Changes:
- Sync client is removed
- Syncer interface introduced
- _DefaultSyncer is a wrapper around the URI upload/download API from Ray AIR
- SyncerCallback only uses remote tasks to synchronize data
- Rsync syncing is fully depracated and removed
- Docker and kubernetes-specific syncing is fully deprecated and removed
- Testing is improved to use `file://` URIs instead of mock sync clients
2022-06-14 14:46:30 +02:00
clarng
d971d3bde4
Fix import order that is causing CI to fail (#25728)
Fix import ordering on master.
2022-06-13 17:36:00 -07:00
Ricky Xu
b1d0b12b4e
[Core \ State Observability] Use Submission client (#25557)
## Why are these changes needed?
This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. 

See #24956 for more details. 

## Summary
<!-- Please give a short summary of the change and the problem this solves. -->
- Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. 

## Related issue number
Closes #24956
Closes #25578
2022-06-13 17:11:19 -07:00
Larry
679f66eeee
[Core/PG/Schedule 1/2]Optimize the scheduling performance of actors/tasks with PG specified only for gcs schedule (#24677)
## Why are these changes needed?

When  schedule actors on pg, instead of iterating all nodes in the cluster resource, This optimize will directly queries corresponding nodes by looking at pg location index.
This optimization can reduce the complexity of the algorithm from O (N) to o (1),and N is the number of nodes. In particular, the more nodes in large-scale clusters, the better the optimization effect.

**This PR only optimize schedule by gcs, I will submit a PR for raylet scheduling later.**

In ant group, Now we have achieved the optimization in the GCS scheduling mode and obtained the following performance test results.
1、The average time of selecting nodes is reduced from 330us to 30us, and the performance is improved by about 11 times.
2、The total time of creating & executing 12,000 actors ranges from 271 (s) - > 225 (s) on average. Reduce time consumption by 17%.

More detailed solution information is in the issue.

## Related issue number

[Core/PG/Schedule]Optimize the scheduling performance of actors/tasks with PG specified #23881
2022-06-13 15:31:00 -07:00
Clark Zinzow
ae9285eced
[Datasets] Add outputs to data generation examples in API docstrings. (#25674)
This PR adds outputs to data generation examples in the API docstrings, namely for `from_items()`, `range()`, `range_table()`, and `range_tensor()`.
2022-06-13 15:28:37 -07:00
Eric Liang
ff2cfbe351
[air] Add streaming BatchPredictor support (#25693) 2022-06-13 15:22:36 -07:00
Eric Liang
fde61a77be
[rfc] [data] SPREAD actor pool actors evenly across the cluster by default (#25705) 2022-06-13 15:16:14 -07:00
Eric Liang
1f90858c9e
[data] Fix stage fusion between equivalent resource args (fixes BatchPredictor) (#25706) 2022-06-13 15:15:59 -07:00
xwjiang2010
cc53a1e28b
[air] update checkpoint.py to deal with metadata in conversion. (#25727)
This is carved out from https://github.com/ray-project/ray/pull/25558. 
tlrd: checkpoint.py current doesn't support the following
```
a. from fs to dict checkpoint;
b. drop some marker to dict checkpoint;
c. convert back to fs checkpoint;
d. convert back to dict checkpoint.
Assert that the marker should still be there
```
2022-06-13 15:15:27 -07:00
clarng
73e113152b
Add import sorting to format.sh (#25678)
It will be easier to develop if we could use a tool to organize / sort imports and not have to move them around by hand.

This PR shows how we could do this with isort (black doesn't quite do this per https://github.com/psf/black/issues/333)

After this PR lands everyone will need to update their formatter to include isort if they don't have it already, i.e.

   pip install -r ./python/requirements_linters.txt 

All future file changes will go through isort and may introduce a slightly larger PR the first time as it will clean up the imports. 

The plan is to land this PR and also clean up the rest of the code in parallel by using this PR to format the codebase (so people won't get surprised by the formatter if the file hasn't been touched yet)

Co-authored-by: Clarence Ng <clarence@anyscale.com>
2022-06-13 14:08:51 -07:00
Antoni Baum
5e9a8eb5f6
[AIR/data] Move preprocessors to ray.data (#25599)
Moves ray.air.Preprocessor and ray.air.preprocessors to ray.data to converge on the agreed upon package structure discussed internally.
2022-06-13 12:57:59 -07:00
Simon Mo
7727dcdac7
[AIR][Serve] Accept predictor.predict kwargs in init (#25537) 2022-06-13 11:46:43 -07:00
Dmitri Gekhtman
5b341ee666
[KubeRay][Minor][CI] Deflake autoscaling test
Minor adjustment to e2e test logic of KubeRay test.
2022-06-13 11:00:47 -07:00
shrekris-anyscale
3278763dd7
[Serve] Start all Serve actors in the "serve" namespace only (#25575) 2022-06-13 10:31:28 -07:00
shrekris-anyscale
2950a4c37a
[Serve] Persist Serve config for REST API (#25651) 2022-06-13 09:53:21 -07:00
Jimmy Yao
7bb142e3e4
[AIR] Refactor ScalingConfig key validation (#25549)
Follow another approach mentioned in #25350.

The scaling config is now converted to the dataclass letting us use a single function for validation of both user supplied dicts and dataclasses. This PR also fixes the fact the scaling config wasn't validated in the GBDT Trainer and validates that allowed keys set in Trainers are present in the dataclass.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-13 18:43:24 +02:00
Kai Fricke
b574f75a8f
[tune/ci] Multinode support killing nodes in Ray client mode (#25709)
The multi node testing utility currently does not support controlling cluster state from within Ray tasks or actors., but it currently requires Ray client. This makes it impossible to properly test e.g. fault tolerance, as the driver has to be executed on the client machine in order to control cluster state. However, this client machine is not part of the Ray cluster and can't schedule tasks on the local node - which is required by some utilities, e.g. checkpoint to driver syncing.

This PR introduces a remote control API for the multi node cluster utility that utilizes a Ray queue to communicate with an execution thread. That way we can instruct cluster commands from within the Ray cluster.
2022-06-13 18:17:12 +02:00
Amog Kamsetty
7a81d488e5
[Autoscaler] Update default AMIs to latest versions (#25684)
Closes #25588

NVIDIA recently pushed updates to the CUDA image removing support for end of life drivers. Therefore, the default AMIs that we previously had for OSS cluster launcher are not able to run the Ray GPU Docker images.

This PR updates the default AMIs to the latest Deep Learning versions. In general, we should periodically update these AMIs, especially when we add support for new CUDA versions.

I manually confirmed that the nightly Ray docker images work with the new AMI in us-west-2.
2022-06-13 17:00:43 +02:00
SangBin Cho
856bea31fb
[State Observability] Ray log CLI / API (#25481)
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.

# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]

Args:
    --tail: Tail the last X lines
    --follow: Follow the new logs
    --actor-id: The actor id
    --pid --node-ip: For worker logs
    --node-id: The node id of the log
    --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
2022-06-13 05:52:57 -07:00
Jiao
f8b0ab7e78
[Ray DAG] Add documentation in more options section (#25528) 2022-06-12 09:47:20 -07:00
Philipp Moritz
d8ec5929b6
Exclude Bazel build files from Ray wheels (#25679)
Including the Bazel build files in the wheel leads to problems if the Ray wheels are brought in as a dependency from another bazel workspace, since that workspace will not recurse into the directories of the wheel that contain BUILD files -- this can lead to dropped files.

This only happens for macOS wheels, on linux wheels the BUILD files were already excluded.
2022-06-11 16:05:59 -07:00
Sven Mika
130b7eeaba
[RLlib] Trainer to Algorithm renaming. (#25539) 2022-06-11 15:10:39 +02:00
Yi Cheng
0c527b4502
[1/2][serve] Use GcsClient to replace the kv client to use timeout. (#25633)
Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly.

This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.
2022-06-10 23:41:49 -07:00
Eric Liang
d36fd77548
[air] Allow fusing task and actor stages if they have compatible resource types (#25683) 2022-06-10 19:04:27 -07:00
Clark Zinzow
4fb92dd2f1
[Datasets] Fix __array__ protocol on TensorArrayElement and TensorArray. (#25647)
This PR fixes two issues with the __array__ protocol on the tensor extension:

1. The __array__ protocol on TensorArrayElement was missing the dtype parameter, causing np.asarray(tae, dtype=some_dtype) calls to fail. This PR adds support for the dtype argument.
2. TensorArray and TensorArrayElement didn't support NumPy's scalar casting semantics for single-element tensors. This PR adds support for these scalar casting semantics.
2022-06-10 16:42:16 -07:00
Richard Liaw
1dd714e0fa
[rfc][doc] Add clarity to stability guidelines (#25611) 2022-06-10 15:19:21 -07:00
Jiao
6b9b1f135b
[Deployment Graph] Move files out of pipeline folder (#25630) 2022-06-10 10:39:03 -07:00