Commit graph

6892 commits

Author SHA1 Message Date
Balaji Veeramani
c4898ed7df
[AIR] [Datasets] Add convert_pandas_to_tf_tensor (#25133)
Dataset.to_tf and TensorflowPredictor attempt to convert Pandas dataframes to NumPy arrays by calling DataFrame.values. However, DataFrame.values fails if the dataframe contains multidimensional arrays.

This PR solves this problem by introducing a function convert_pandas_to_tf_tensor. The implementation of the function is based on the implementation of convert_pandas_to_torch_tensor.
2022-06-06 08:29:51 -07:00
Sebastián Ramírez
298742d724
♻️ Refactor type annotations for .remote() to avoid incorrect autocompletion and checks (#25480)
With the current type annotations for the `.remote()` method generated in decorated functions, editors understand that there are some keyword arguments `arg0`, `arg1`, etc. Which are incorrect as the actual function will probably have different names for its arguments.

For example, this shouldn't autocomplete `arg0`, `arg1`, etc:

<img width="407" alt="Screenshot 2022-06-04 at 06 13 46" src="https://user-images.githubusercontent.com/1326112/171996654-12248369-cf10-4fce-9ea2-5deb4ca8e2bd.png">

If anything, it should autocomplete `x` and `y` (although that's currently [not perfectly doable](https://github.com/python/typing/discussions/1163)).

By updating the type annotations to use [arguments prefixed with double underscores](https://mypy.readthedocs.io/en/stable/protocols.html?highlight=double%20underscore#callback-protocols) at least it tells tooling to not provide autocompletion for those args (which would be incorrect). While still providing inline errors for invalid types.

<img width="880" alt="Screenshot 2022-06-04 at 06 20 26" src="https://user-images.githubusercontent.com/1326112/171996806-560c0fa8-0ee3-477c-9906-71e880c84e56.png">
2022-06-05 16:21:53 -07:00
Eric Liang
48acbf0d69
[hotfix] Revert "[runtime env] runtime env inheritance refactor (#24538)" (#25487)
This reverts commit eb2692c.

This is a temporary mitigation for #25484
2022-06-05 14:55:38 -07:00
Sebastián Ramírez
6e1248fb37
🚚 Move worker types to the module to improve static analysis (#25439)
Currently, there are separated type annotations in `worker.pyi` that include the types for `func.remote()`, but they don't include types for the other things declared in `worker.py`. Because of that, editors can end up showing support only for the things in the `worker.pyi` file.

For example:

<img width="349" alt="Screenshot 2022-06-03 at 06 01 36" src="https://user-images.githubusercontent.com/1326112/171841977-ec7a0b9a-b4a5-4422-acd9-b73c1e263261.png">

After this change, the editor and other tools will be able to provide support for other things defined in the same file:

<img width="760" alt="Screenshot 2022-06-03 at 06 04 24" src="https://user-images.githubusercontent.com/1326112/171842204-1915dd2a-6cc6-41b7-8785-5124beec37e8.png">

And the typing support for `func.remote()` keeps working as before:

<img width="760" alt="Screenshot 2022-06-03 at 06 07 15" src="https://user-images.githubusercontent.com/1326112/171842528-f318753e-9f47-4236-b0a4-d86d00c0bb11.png">

This is the recommended approach by PyRight/Pylance/VS Code. I also recommend it as it's a lot easier to maintain types in the same file while editing than remembering to go to an external independent file to add those types. Also, to have proper support when using an external `.pyi` file *all* the things declared in `worker.py` would have to be declared in the `worker.pyi` file.

Ref: https://github.com/microsoft/pyright/blob/main/docs/typed-libraries.md#inlined-type-annotations-and-type-stubs
2022-06-05 14:01:24 -07:00
matthewdeng
7dafb2e278
[air] remove invalid wandb symlink (#25488) 2022-06-04 22:17:08 -07:00
SangBin Cho
00e3fd75f3
[State Observability] Ray log alpha API (#24964)
This is the PR to implement ray log to the server side. The PR is continued from #24068.

The PR supports two endpoints;

/api/v0/logs # list logs of the node id filtered by the given glob. 
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.

As a follow-up after this PR, there will be 2 PRs.

PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs
2022-06-04 05:10:23 -07:00
Yi Cheng
47c4f6f094
[flakey] Fix test_modin.py (#25469)
test_modin.py is flakey right now. It complains about some modules can't be imported. This seems like a init issue where client mode and non-client mode are mixed. This test closes the cluster for each run. It slows the test a little bit, but it's more stable.
2022-06-04 08:34:37 +00:00
Sven Mika
b5bc2b93c3
[RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00
SangBin Cho
54496d7705
[State Observability API] Support Filtering (#25281)
This PR adds a filtering support. The filtering is done from the API server side (not from the source side). Source side filtering is a bit complicated to write an elegant solution, and we will handle it in the future (no optimization for alpha APIs).

We will also support limited types of columns for each API.

The API is as follows

ray list [resources] -- filter [key] [value] => filter data that's key==value. 
In the future, we can also support more complicated filtering like !=, And, Or , or etc.
2022-06-03 17:17:30 -07:00
Eric Liang
1f509ab331
[air] Add DatasetParallelTrainer.dataset_config for configuring dataset ingest (#25337)
This adds a per-dataset config object to DataParallelTrainer. These configs define how the Dataset should be read into the DataParallelTrainer. It configures the preprocessing, splitting, and ingest strategy per-dataset. DataParallelTrainers declare default DatasetConfigs for each dataset passed in the ``datasets`` argument. Users have the opportunity to selectively override these configs by passing the ``dataset_config`` argument. Trainers can also define user customizable values (e.g., XGBoostTrainer doesn't support streaming ingest).

This PR adds the minimal support for dataset configs. Future PRs will:
- Add support for streaming ingest
- Move this config from DataParallelTrainer to ml.Trainer
2022-06-03 16:32:53 -07:00
Eric Liang
22aaf47fda
[tune] Better error message for Tune nested tasks / actors (#25241)
This PR uses a task/actor launch hook to generate better error messages for nested Tune tasks/actors in the case there are no extra resources reserved for them. The idea is that the Tune trial runner actor can set a hook prior to executing the user code. If the user code launches a task, and the placement group for the trial cannot possibly fit the task, then we raise TuneError right off to warn the user.
2022-06-03 14:53:40 -07:00
Sihan Wang
03ed27b9c1
[Serve] Fix the test_serve_start_different_http_checkpoint_options_warning flaky (#25452) 2022-06-03 14:45:00 -07:00
Kai Fricke
4b9a89ad90
[air] Move python/ray/ml to python/ray/air (#25449)
The package "ml" should be renamed to "air".

Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility?
I'd go for no to force people to use the new structure.
2022-06-03 21:53:44 +01:00
Yi Cheng
6b38b071e9
Revert "Revert "[core] Remove gcs addr updater in core worker. (#24747)" (#25375)" (#25391)
This reverts commit 49efcab4fe.
2022-06-03 12:26:27 -07:00
Kai Fricke
7186cd8b79
[tune] Remove various deprecated code paths (deprecation cycle) (#25407)
This PR removes various deprecated code paths in Ray Tune that raised errors on usage before.
2022-06-03 15:01:40 +01:00
Kai Fricke
2e058380d7
[tune] Remove TrialExecutor base class (#25404)
The TrialExecutor base class was a stub and has been deprecated long ago; direct inheritance was disabled. This PR removes the base class and moves the remaining functionality into the RayTrialExecutor.
2022-06-03 10:16:47 +01:00
Kai Fricke
f0fa8e54f8
[tune] Remove DurableTrainable class (#25405)
The DurableTrainable is deprecated (every trainable is a durable trainable). This PR removes it from the Tune library and a related example.
2022-06-03 10:16:02 +01:00
Antoni Baum
84a9df9448
[AIR/Tune] Add TempFileLock (#25408)
Adds a `TempFileLock` class that stores lockfiles inside a temporary directory.
2022-06-03 10:12:53 +01:00
Yi Cheng
60587cf1dc
[flakey] Deflakey test_ray_shutdown.py (#25422)
The main issue with this test is that the worker is trying to connect to the raylet but the raylet exits, and in this case, it'll hang there. This happens before the periodical check runs so the worker won't exit as well.

This fix moves the hanging part to the place after the periodical check starts.

Another issue is the pubsub timeout. The default one is 60s, and we need to adjust it to smaller value to make it work within 60s for the test.
2022-06-02 23:00:33 -07:00
Yi Cheng
fd0f967d2e
Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir and rename policy and trainer classes. (#25346)" (#25420)
This reverts commit e4ceae19ef.

Reverts #25346

linux://python/ray/tests:test_client_library_integration never fail before this PR.

In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR.

And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)
2022-06-02 20:38:44 -07:00
SangBin Cho
ba90838b66
[Log monitor] Add unit tests + fix flaky test_logging (#25294)
Looks like the test_logging fails when syncer is enabled. However, I found the test was badly written, and the failure might be a side effect of syncer (I am not sure why. Maybe syncer slows down ray.init()?)

ray/python/ray/tests/test_logging.py

Line 228 in f75ede1

 def test_log_monitor_backpressure(ray_start_cluster, monkeypatch): 
Anyway, it seems like the test will fail if there's a delay after log monitor is started.
Testing this is not trivial. Instead, I made log_monitor unit testable and added full unit tests.

This also adds a better exception message on another flaky test test_log_rotation . I need more data before actually fixing this issue.
2022-06-02 19:15:57 -07:00
Siyuan (Ryans) Zhuang
b5e71fde23
[workflow] Remove workflow virtual actor (#25394)
* remove workflow virtual actor
2022-06-02 18:17:25 -07:00
Amog Kamsetty
c8b112ec46
[Train] Support amp for models with a custom __getstate__ method (#25335)
The current implementation of amp does not work if the model that is being wrapped defines a custom __getstate__ method. It would fail at the assertion like here: https://discuss.ray.io/t/ray-train-hangs-for-long-time/6333/7.

This PR fixes amp for this case, and adds tests for it.
2022-06-02 18:13:13 -07:00
Antoni Baum
f8551942bf
[AIR] Fix trainer allowed scaling config keys (#25350)
Adds `resources_per_worker` to allowed scaling config keys in `DataParallelTrainer` and `GBDTTrainer`.
2022-06-02 11:20:37 -07:00
shrekris-anyscale
16bdfe6a39
Restore "[Serve] Deploy Serve deployment graphs via REST API" (#25073) (#25333) 2022-06-02 11:06:53 -07:00
Stephanie Wang
ab8785ca5c
Revert "Revert "[core] Support generators for tasks with multiple return values (#25247)" (#25380)" (#25383)
Duplicate for #25247.

Adds a fix for Dask-on-Ray. Previously, for tasks with multiple return values, we implicitly allowed returning a dict with the return index as the key. This was used by Dask-on-Ray, but this is not documented behavior, and we now require task returns to be iterable instead.
2022-06-02 10:50:11 -07:00
Sihan Wang
b024a9543e
[Serve] Support scale replica down to 0 (#24892) 2022-06-02 08:06:46 -07:00
Sven Mika
e4ceae19ef
[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir and rename policy and trainer classes. (#25346) 2022-06-02 16:47:05 +02:00
Antoni Baum
045c47f172
[CI] Check test files for if __name__... snippet (#25322)
Bazel operates by simply running the python scripts given to it in `py_test`. If the script doesn't invoke pytest on itself in the `if _name__ == "__main__"` snippet, no tests will be ran, and the script will pass. This has led to several tests (indeed, some are fixed in this PR) that, despite having been written, have never ran in CI. This PR adds a lint check to check all `py_test` sources for the presence of `if _name__ == "__main__"` snippet, and will fail CI if there are any detected without it. This system is only enabled for libraries right now (tune, train, air, rllib), but it could be trivially extended to other modules if approved.
2022-06-02 10:30:00 +01:00
Qing Wang
64f9a9066f
[doc] Update document on ray start command. (#25306) 2022-06-02 16:42:24 +08:00
Yi Cheng
cb1f08a3c1
[core] Basic end-2-end multi-node tests for GCS HA in CI. (#25114)
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.

To make it close to the real-world case, the docker is used for isolation:

It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.
This is the basic cases for serve HA. We'll add more once we get better integrations.
2022-06-02 02:41:38 +00:00
Dmitri Gekhtman
e45054c130
[autoscaler][kuberay] Fix autoscaler event driver logs. Clean up entrypoint. (#25240)
This PR 
- enables piping of autoscaler events to the driver's stdout with KubeRay
- cleans up the autoscaler's startup sequence
- removes some redis references
2022-06-01 20:36:47 -04:00
Antoni Baum
70007c004e
[AIR] MultiHotEncoder and list support for encoders (#25319) 2022-06-01 17:34:41 -07:00
Yi Cheng
80168a09a6
Revert "[core] Support generators for tasks with multiple return values (#25247)" (#25380)
This reverts commit 1f9488724a.
2022-06-01 15:31:59 -07:00
SangBin Cho
49efcab4fe
Revert "[core] Remove gcs addr updater in core worker. (#24747)" (#25375)
Turns out https://github.com/ray-project/ray/pull/25342 wasn't the root cause of the ray shutdown flakiness. I realized there's another PR that could affect this test suite. Let's try reverting it and see if things get better.
2022-06-01 15:12:33 -07:00
Stephanie Wang
961b875ab8
[core] Allow user to override global default for max_retries (#25189)
This PR allows the user to override the global default for max_retries for non-actor tasks. It adds an OS env called RAY_task_max_retries which can be passed to the driver or set with runtime envs. Any future tasks submitted by that worker will default to this value instead of 3, the hard-coded default.

It would be nicer if we could have a standard way of setting these defaults, but I think this is fine as a one-off for now (not a clear need for overriding defaults of other @ray.remote options yet).
Related issue number

Closes #24854.
2022-06-01 14:42:18 -07:00
Stephanie Wang
1f9488724a
[core] Support generators for tasks with multiple return values (#25247)
Adds support for Python generators instead of just normal return functions when a task has multiple return values. This will allow developers to cut down on total memory usage for tasks, as they can free previous return values before allocating the next one on the heap.

The semantics for num_returns are about the same as usual tasks - the function will throw an error if the number of values returned by the generator does not match the number of return values specified by the user. The one difference is that if num_returns=1, the task will throw the usual Python exception that the generator cannot be pickled.

As an example, this feature will allow us to reduce memory usage in Datasets shuffle operations (see #25200 for a prototype).
2022-06-01 13:30:52 -07:00
Antoni Baum
9085ea23ab
[AIR] Improve BatchPredictor performance & disk usage (#25101)
This PR attempts to improve `BatchPredictor` performance with directory checkpoints by avoiding unnecessary filesystem operations.

In order to achieve that, the `Checkpoint` class is changed to always use a canonical path for the temporary directory if the Checkpoint has been created form an object ref. The directory is filelocked to prevent concurrent writes.

Tests have been addded.

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-06-01 21:45:39 +02:00
Eric Liang
905258dbc1
Clean up docstyle in python modules and add LINT rule (#25272) 2022-06-01 11:27:54 -07:00
Jiao
97190e4574
[Deployment Graph] Remove _execute_impl and json serde code for DeploymentNode IR (#25331) 2022-06-01 11:26:56 -07:00
Eric Liang
517f78e2b8
[minor] Add a job submission hook by env var (#25343) 2022-06-01 11:15:43 -07:00
SangBin Cho
ca75570f51
Revert "Revert "Revert "[dataset] Use polars for sorting (#24523)" (#24781)" (#25173)" (#25341)
This reverts commit 61676f26d3.
2022-06-01 10:49:12 -07:00
Chen Shen
49b8bbfd5e
[Core] Fix node affinity strategy when resource is empty (#25344)
Why are these changes needed?
Today, Ray scheduler always pick a random node if the resource requirement is empty, regardless of scheduling policy/strategy.

However, for node affinity scheduling policy, we should not pick random policy but try to stick to the node affinity constraints.
2022-06-01 10:38:48 -07:00
siavash119
21f1e8a5c6
[Core] Use newly pushed actor for existing pending tasks (#24980)
Newly pushed actors will never be used with existing pending submits, so the worker will not be used to speed up existing tasks. If _return_actor is called at the end of push instead, the actor is pushed to _idle_actors and immediately used if there are pending submits.
2022-06-01 07:51:02 -07:00
SangBin Cho
44483a6c99
[Test][Windows] Skip test metrics.py in Windows (#25287)
Skip the flaky test_metrics on Windows
2022-06-01 05:37:29 -07:00
valtab
288a81b42e
[Train]fix train callback nested recusive calling issue (#25015)
Move  initialization for `callback.results_preprocessor` property to `callback.start_training()` method which only be called once while training start, currently initialization is triggered per message.
2022-05-31 20:09:01 -07:00
Eric Liang
acf0da63b6
[data] [API] Remove unnecessary public argument in fully_executed() (#25267) 2022-05-31 16:48:35 -07:00
Eric Liang
5545bc5f45
[data] Fix pipeline pre-repeat caching, and improve the documentation (#25265)
Currently the canonical way to cache a pipeline and repeat it: ds.fully_executed().repeat() crashes. Add a test, fix the docs and stats printing here.
2022-05-31 16:01:00 -07:00
shrekris-anyscale
7754645c83
Revert "[Serve] Deploy Serve deployment graphs via REST API (#25073)" (#25330)
This reverts commit 47709b3300.
2022-05-31 15:37:55 -07:00
shrekris-anyscale
47709b3300
[Serve] Deploy Serve deployment graphs via REST API (#25073) 2022-05-31 10:57:08 -07:00