Commit graph

6159 commits

Author SHA1 Message Date
Eric Liang
35a157948e
Lay the groundwork for lazy dataset optimization (no behavior changes) (#22233)
This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.
2022-02-14 15:03:58 -08:00
Jialing He
192f9de421
[runtime env] Introduce async Manager.create (#22311) 2022-02-14 16:26:47 -06:00
Matti Picus
845861fdc1
[runtime env] use pytest tmp_path, os.path.sep, and unskip most tests for windows (#22342) 2022-02-14 16:04:10 -06:00
Archit Kulkarni
0e350c0074
[runtime env] [Doc] Add two ways of installing dependencies: cluster launcher, and runtime env (#20780)
We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments. 

This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two.

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2022-02-14 16:03:48 -06:00
Clark Zinzow
53c4c7b1be
[Datasets] Expose TableRow as public API; minimize copies/type conversions on row-based ops. (#22305)
This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made:
1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions.
2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.
2022-02-14 12:56:17 -08:00
dependabot[bot]
767b349b99
[data](deps): Bump dask[complete] (#22334)
Bumps [dask[complete]](https://github.com/dask/dask) from 2022.1.0 to 2022.2.0.
- [Release notes](https://github.com/dask/dask/releases)
- [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md)
- [Commits](https://github.com/dask/dask/compare/2022.01.0...2022.02.0)

---
updated-dependencies:
- dependency-name: dask[complete]
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-14 12:44:20 -08:00
Edward Oakes
610930ae6a
[serve] Improve health check failure semantics (#22297) 2022-02-14 14:04:03 -06:00
Clark Zinzow
443416907e
[Datasets] Fix boolean tensor column representation and slicing. (#22323)
This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.
2022-02-14 10:36:35 -08:00
Max Pumperla
d594b668bb
[docs] [tune] hyperopt notebook (#22315) 2022-02-12 02:46:03 -08:00
Eric Liang
85d6946c95
Split test_dataset.py into two (#22303) 2022-02-12 00:21:25 -08:00
Amog Kamsetty
4cbbc81f4c
[Train] Add support for trainer.best_checkpoint and Trainer.load_checkpoint_path (#22306)
Closes #22226
2022-02-11 22:29:37 -08:00
Kaushik B
8515fdd6db
[tune] Update Lightning examples to support PTL 1.5 (#20562)
To helps resolve the issues users are facing with running Lightning examples with Ray Tune PyTorchLightning/pytorch-lightning#10407

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-02-11 17:45:06 -08:00
Amog Kamsetty
e8e35169c6
[Train] Allow train methods to be called outside of the session (#21969)
Updates to address @worldveil's feedback:

Include import train.torch in the docs
Allow methods in session.py to be called outside of the session with sensible defaults. These will no longer raise an error.

Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
2022-02-11 17:42:55 -08:00
jialin
851b853352
add optional empty lines filter in read_text (#22298)
ray.data.read_text() currently doesn't take care of empty lines; this pr adds a flag to enable the empty line filter; 
with this change, read_text will only return non-empty line by default, unless otherwise setting drop_empty_line to False.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jialin Liu <jialin.liu@bytedance.com>
2022-02-11 14:49:45 -08:00
Edward Oakes
49b3e6c53c
[serve] Support user-provided health check via def check_health(self) method (#22178) 2022-02-11 12:53:37 -06:00
matthewdeng
2c204a755b
[train] add minimal installation test suite (#22300)
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.

**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
2022-02-11 10:09:00 -08:00
Archit Kulkarni
1c0c2aaba2
[runtime env] Add test for scheduling task after failed job env (#22224)
Adds a test to make sure a failed job runtime env creation doesn't hang the cluster (i.e. tasks can still be scheduled on the job, as long as the tasks' runtime env can be created.).  Test requested by @rkooo567, good idea!
2022-02-11 11:01:16 -06:00
Chen Shen
bb6cb0898b
[Dataset] avoid pyarrow 7.0.0 for dataset (#22253) 2022-02-11 00:32:47 -08:00
Eric Liang
02add259ca
Add more details to the internal error for "worker cannot find registered function" (#22302)
This adds some more debug information for this internal error that shouldn't happen.
2022-02-10 23:20:17 -08:00
Edward Oakes
dd097b7a9b
[serve] Fix HTTP proxy controller namespace bug (#22287)
Closes https://github.com/ray-project/ray/issues/22265

This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.
2022-02-10 21:05:35 -06:00
Clark Zinzow
13c8e10b3b
[Datasets] Unrevert NaN handling. (#22291)
Reverts #22258, unreverting #20787. 

The fix is in the ["Fix tests" commit](b559da2407), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](https://github.com/ray-project/ray/pull/22258#issuecomment-1035404700).
2022-02-10 16:19:53 -08:00
Amog Kamsetty
09e46066eb
[Train] Fix accuracy calculation for CIFAR example (#22292)
Same as #21689 except for cifar
2022-02-10 15:06:31 -08:00
Archit Kulkarni
94f73de23c
Revert "[Serve] [Windows] Unskip all but test_redeploy_single_replica in test_deploy.py (#21391)" (#22299)
This reverts commit 000c56f764.
2022-02-10 14:49:34 -08:00
Edward Oakes
48adb6f7bb
[serve] Introduce DeploymentStatus, poll for statuses instead of using async goals (#22121) 2022-02-10 12:33:04 -08:00
mwtian
9bc6f13515
[Autoscaler] make --redis-address not required (#22083)
`--redis-address` should not be required, since starting autoscaler with `--gcs-address` is supported too.
2022-02-10 11:20:31 -08:00
Liu Bao
824453dd17
[runtime env] Create virtualenv for pip runtime env. (#21801) 2022-02-10 12:25:18 -06:00
mwtian
c9fed9dec2
Revert "[Client] avoid locking in async send" (#22283)
Reverts ray-project/ray#22193, which makes `windows://python/ray/serve:test_ray_client` very flaky (timeouts).
2022-02-10 10:17:07 -08:00
shrekris-anyscale
cc9018c29a
Obtain deployment definitions via import (#22272)
Currently, Serve deployments must store their class or function definitions in the `Deployment` object's `func_or_class` attribute. However, the declarative API must be able to initiate deployments using only their import path. This allows users to separately define their functions or classes, and pull these functions and classes into their clusters via [remote URIs](https://docs.ray.io/en/releases-1.9.2/handling-dependencies.html#remote-uris). With this change, `Deployment` objects can store an import path string as their `func_or_class`. This import path is then used to import the deployment's code definition when the `Deployment`'s replica is created.
2022-02-10 10:20:45 -06:00
mwtian
2cee219250
[Core] avoid warning when receiving too much logs from a different job (#22102)
When logs are not intended for the current driver, skip logging warning about too much logs being generated, and clear the counters for log rates.

Ideally the log subscriber should only subscribe to logs from the current job, and system logs. But the change has risk and we can do it in another PR.
2022-02-10 15:17:26 +09:00
Balaji Veeramani
abad268549
Comment fmt: off annotations (#21984)
Code formatting is disabled in several modules with the explanation
> [The module] ignores yapf because yapf doesn't allow comments right after code blocks,
but we put comments right after code blocks to prevent large white spaces
in the documentation.

Since we no longer use YAPF, it may be possible to re-enable code formatting on 
these modules. I've added "FIXME" comments requesting developers to check
whether code formatter appeasements are still necessary.
2022-02-09 22:12:11 -08:00
Jiajun Yao
07a1ba8e34
Update local object store usage (#22157)
* Update local object store usage

* fix

* test
2022-02-09 22:08:25 -08:00
Dmitri Gekhtman
f51566e622
Prep K8s operator for the Ray 1.11.0 release. (#22264)
For consistency and safety, we fix an explicit 6379 port for all default and example configs for Ray on K8s.
Documentation is updated to recommend matching Ray versions in operator and Ray cluster.
2022-02-09 18:59:50 -08:00
Stephanie Wang
495eb14179
[core] Recover spilled objects that are lost during node failure (#21485)
* Failing test

* trigger recovery from ref counter

* x

* update

* lint

* stress test

* update

* format

* x
2022-02-09 18:22:16 -08:00
Jiajun Yao
d295a9d545
Revert "[Datasets] Support ignoring NaNs in aggregations. (#20787)" (#22258)
This reverts commit f264cf800a.
2022-02-09 16:25:19 -08:00
SangBin Cho
30000ff8ae
Fix a bug from many drivers. (#22248)
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.
2022-02-09 15:17:15 -08:00
SangBin Cho
e5cab878b8
[Core] Disable runtime env logs (#22198)
Disable runtime env logs streamed to the driver by default and improve the documentation.
2022-02-09 14:43:25 -08:00
Alex Wu
35028182f0
[Autoscaler] No infeasible warning for placement groups (#22235)
Ensures that we don't log a warning message about an infeasible resource demand when that custom resource is a placement group (since the placement group resource likely just needs to be created by the raylet).

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-09 12:47:36 -08:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Jiao
293e45c527
[Doc] [Serve] Fix README's quick_start and add to test suite (#22228) 2022-02-09 11:49:47 -08:00
Jiao
54a71e6c4f
[Ray DAG] Add execute() interface to take user inputs with ENTRY_POINT tag. (#22196)
## Diff Summary

Current implementation of DAGNode pre-bind inputs and the signature of `def execute(self)` doesn't take user input yet. This PR extends the interface to take user input, mark DAG entrypoint methods as first stop of all user requests in a DAG. It's needed to unblock next step serve pipeline implementation to serve user requests.

Closes #22196 #22197

Notable changes:
- Added a `DAG_ENTRY_POINT` flag in ray dag API to annotate DAG entrypoint functions. Function or class method only. All marked functions will receive identical input from user as first layer of DAG. 
- Changed implementations of ClassNode and FunctionNode accordingly to handle different execution for a node marked as entrypoint or not.
- Added a `kwargs_to_resolve` kwarg in the interface of `DAGNode` to handle args that sub-classes need to use to resolve it's implementation without exposing all terms to parent class level.
  - This is particularly important for ClassMethodNode binding, so we can have implementations to track method name, parent ClassNode as well as previous class method call without existiting 
  - Changed implementation of `_copy()` to handle execution of `kwargs_to_resolve`.
  - Changed implementation of `_apply_and_replace_all_child_nodes()` to fetch DAGNode type in `kwargs_to_resolve`.
- Added pretty printed lines for `kwargs_to_resolve`
2022-02-09 13:29:28 -06:00
Jiajun Yao
673ecd1241
Isolate ray configs for each job (#22206)
If we run multiple jobs in the same process (this is basically the behavior of python tests), they should be isolated in the sense that system config for job 1 shouldn't affect config for job 2.
```
ray.init(_system_config={})
# job 1
ray.shutdown()

ray.init(_system_config={})
# job 2
ray.shutdown()
```

Currently it's not the case, since RayConfig is a static variable and it's shared across drivers in the same process. This PR resets the configs to default value before applying job specific _system_config.

Note: it's backward incompatible change if user depends on the current behavior but I'm not aware of such case.
2022-02-09 10:18:46 -08:00
Alex Wu
c9a419ac76
[Autoscaler] Remove staroid node provider (#22236)
The Staroid node provider has been abandoned and unmaintained for quite some time now. Due to the fact that there are no active maintainers, the original contributors cannot be reached, and there is no clear interest, we are no longer officially endorsing or supporting the node provider.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-09 09:18:18 -08:00
xwjiang2010
323511b716
[tune] Single wait refactor. (#21852)
This is a down scoped change. For the full overview picture of Tune control loop, see [`Tune control loop refactoring`](https://docs.google.com/document/d/1RDsW7SVzwMPZfA0WLOPA4YTqbRyXIHGYmBenJk33HaE/edit#heading=h.2za3bbxbs5gn)

1. Previously there are separate waits on pg ready and other events. As a result, there are quite a few timing tweaks that are inefficient, hard to understand and unit test. This PR consolidates into a single wait that is handled by TrialRunner in each step.
- A few event types are introduced, and their mapping into scenarios
  * PG_READY --> Should place a trial onto it. If somehow there is no trial to be placed there, the pg will be put in _ready momentarily. This is due to historically resources is conceptualized as a pull based model. 
  * NO_RUNNING_TRIALS_TIME_OUT --> possibly not sufficient resources case
  * TRAINING_RESULT
  * SAVING_RESULT
  * RESTORING_RESULT
  * YIELD --> This just means that simply taking very long to train. We need to punt back to the main loop to print out status info etc.

2. Previously TrialCleanup is not very efficient and can be racing between Trainable.stop() and `return_placement_group`. This PR streamlines the Trial cleanup process by explicitly let Trainable.stop() to finish followed by `return_placement_group(pg)`. Note, graceful shutdown is needed in cases like `pause_trial` where checkpointing to memory needs to be given the time to happen before the actor is gone. 

3. There are quite some env variables removed (timing tweaks), that I consider OK to proceed without deprecation cycle.
2022-02-09 15:31:17 +00:00
Clark Zinzow
f264cf800a
[Datasets] Support ignoring NaNs in aggregations. (#20787)
Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in `ds.mean("A", ignore_nulls=False)` if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics:
1. Mix of values and nulls - `ignore_nulls`=True: Ignore the nulls, return aggregation of values
2. Mix of values and nulls - `ignore_nulls`=False: Return `None`
3. All nulls: Return `None`
4. Empty dataset: Return `None`

This all null and empty dataset handling matches the semantics of NumPy and Pandas.
2022-02-09 00:07:58 -08:00
mwtian
71f63593f4
[Client] avoid locking in async send (#22193)
As @iycheng discovered in https://github.com/ray-project/ray/issues/22082#issuecomment-1031821631, when `ClientObjectRef` is being GC'ed, `DataClient.lock` is acquired which may cause deadlock. This change avoids acquiring lock in `DataClient._async_send()`.
2022-02-08 22:14:15 -08:00
SangBin Cho
20ab9188c6
[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170)
This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details.

The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj.

You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files.

The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. 

After this PR, we will add code to enable usage report "off by default".
2022-02-08 22:12:36 -08:00
SangBin Cho
d7cead7519
[Core] Improve ray stop (#22159)
I’d like to make a small proposal to change the behavior of ray stop

### Status quo
It basically just sends a SIGTERM and finish asynchronously. It is vulnerable to have leaked processes or port conflict when you run ray stop && ray start I feel like the right behavior is as follow;

### New
- Send sigterm and wait for processes to terminate
- Display the progress to users
- If procs are not terminated by X seconds, send SIGKILL.

### API change
We will add `--grace-period` flag. The default is ray `stop --grace-period=X (10 seconds by default)`

And if users don’t want to be blocked, they can use ray stop --force (which already exists. It just sends SIGKILL, so procs are guaranteed to be terminated)
2022-02-08 22:07:44 -08:00
Gagandeep Singh
000c56f764
[Serve] [Windows] Unskip all but test_redeploy_single_replica in test_deploy.py (#21391) 2022-02-08 16:30:25 -08:00
Balaji Veeramani
31ed9e5d02
[CI] Replace YAPF disables with Black disables (#21982) 2022-02-08 16:29:25 -08:00
Stephanie Wang
dcd96ca348
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.
2022-02-08 14:50:50 -08:00