Commit graph

13052 commits

Author SHA1 Message Date
Stephanie Wang
d699351748
[datasets] Use generators for merge stage in push-based shuffle (#25336)
This uses the generators introduced in #25247 to reduce memory usage during the merge stage in push-based shuffle. These tasks merge groups of map outputs, so it fits a generator pattern where we want to return merged outputs one at a time. Verified that this allows for merging more/larger objects at a time than the current list-based version.

I also tried this for the map stage in random_shuffle, but it didn't seem to make a difference in memory usage for Arrow blocks. I think this is probably because Arrow is already doing some zero-copy optimizations when selecting rows?

Also adds a new line to Dataset stats for memory usage. Unfortunately it's hard to get an accurate reading of physical memory usage in Python and this value will probably be an overestimate in a lot of cases. I didn't see a difference before and after this PR for the merge stage, for example. Arguably this field should be opt-in. For 100MB partitions, for example:
```
        Substage 0 read->random_shuffle_map: 10/10 blocks executed
        * Remote wall time: 1.44s min, 3.32s max, 2.57s mean, 25.74s total
        * Remote cpu time: 1.42s min, 2.53s max, 2.03s mean, 20.25s total
        * Worker memory usage (MB): 462 min, 864 max, 552 mean
        * Output num rows: 12500000 min, 12500000 max, 12500000 mean, 125000000 total
        * Output size bytes: 101562500 min, 101562500 max, 101562500 mean, 1015625000 total
        * Tasks per node: 10 min, 10 max, 10 mean; 1 nodes used

        Substage 1 random_shuffle_reduce: 10/10 blocks executed
        * Remote wall time: 1.47s min, 2.94s max, 2.17s mean, 21.69s total
        * Remote cpu time: 1.45s min, 1.88s max, 1.71s mean, 17.09s total
        * Worker memory usage (MB): 462 min, 1047 max, 831 mean
        * Output num rows: 12500000 min, 12500000 max, 12500000 mean, 125000000 total
        * Output size bytes: 101562500 min, 101562500 max, 101562500 mean, 1015625000 total
        * Tasks per node: 10 min, 10 max, 10 mean; 1 nodes used
```


## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-06-17 12:29:24 -07:00
Stephanie Wang
293c122302
[dataset] Use polars for sorting (#25454) 2022-06-17 12:26:46 -07:00
Clark Zinzow
c2ab73fc40
[Datasets] Add ray_remote_args to read_text. (#23764) 2022-06-17 12:24:11 -07:00
Archit Kulkarni
85be093a84
[runtime env] Make all plugins return a List of URIs (#25825)
Followup from #24622.  This is another step towards pluggability for runtime_env.  Previously some plugin classes had `get_uri` which returned a single URI, while others had `get_uris` which returned a list.  This PR makes all plugins use `get_uris`, which simplifies the code overall.

Most of the lines in the diff just come from the new `format.sh` which sorts the imports.
2022-06-17 14:13:44 -05:00
Sven Mika
d90c6cfbd6
[RLlib] SimpleQ PolicyV2 (sub-classing). (#25871) 2022-06-17 20:12:16 +02:00
Simon Mo
438b6c78c8
[Release Tests] Add memory monitoring for Serve release test (#25868) 2022-06-17 11:11:56 -07:00
Stephanie Wang
09857907b7
[data] Fix bug in computing merge partitions in push-based shuffle (#25865)
Fixes a bug in push-based shuffle in computing the merge task <> reduce tasks mapping when the number of reduce tasks does not divide evenly by the number of merge tasks. Previously, if there were N reduce tasks for one merge task, we would do:
[N + 1, N + 1, ..., N + 1, all left over tasks]
which could lead to negative many reduce tasks n the last merge partition.

This PR changes it to:
[N + 1, N + 1, ..., N + 1, N, N, N, ...]
Related issue number

Closes #25863.
2022-06-17 10:19:00 -07:00
Alex Wu
187c21ce20
[gcs] Preserve job driver info for dashboard (#25880)
This PR ensures that GCS keeps the IP and PID information about a job so that it can be used to find the job's logs in the dashboard after the job terminates.

@alanwguo will handle any dashboard work in a separate PR.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-06-17 09:03:20 -07:00
Archit Kulkarni
b24c736bb8
[Doc] [runtime env] Add note that excludes paths are relative to working_dir (#25874)
Users' intuition might lead them to fill out `excludes` with absolute paths, e.g. `/Users/working_dir/subdir/`.  However, the `excludes` field uses `gitignore` syntax.  In `gitignore` syntax, paths that start with `/` are interpreted relative to the level of the directory where the `gitignore` file resides, and in our case this is the `working_dir` directory (morally speaking, since there's no actual `.gitignore` file.)  So the correct thing to put in `excludes` would be `/subdir/`.  As long as we support `gitignore` syntax, we should have a note in the docs for this.  This PR adds the note.
2022-06-17 10:50:04 -05:00
sychen52
edf16b8e2c
[docs] Edit the output of the script to match the code (#25855) 2022-06-17 10:48:28 -05:00
Fabian Witter
fcdf710574
[RLlib] Move offline input into replay buffer using rollout ops in CQL. (#25629) 2022-06-17 17:08:55 +02:00
matthewdeng
5c6b91d375
[Release] fix Horovod release tests (#25873)
Error message suggests:

Wait timeout after 30 seconds for key(s): 0. You may want to increase the timeout via HOROVOD_GLOO_TIMEOUT_SECONDS

Bumped up to 120 seconds.

Tests run successfully: https://buildkite.com/ray-project/release-tests-pr/builds/6906
2022-06-17 14:52:54 +01:00
Artur Niederfahrenhorst
a322cc5765
[RLlib] IMPALA/APPO multi-agent mix-in-buffer fixes (plus MA learning tests). (#25848) 2022-06-17 14:10:36 +02:00
Simon Mo
1c27469b6d
[macOS] Only cleanup directory after upload (#25835)
Missed it in previous enablement of uploading bazel log, we should no longer clean the directory anymore.
2022-06-17 12:46:37 +01:00
Kai Fricke
40a9fdcb0f
[tune/air] Fix checkpoint conversion for objects (#25885)
Converting Tracked memory checkpoints was faulty and untested.
2022-06-17 10:41:52 +01:00
Artur Niederfahrenhorst
e5740946b8
[RLlib] Fixes logging of all of RLlib's Algorithm names as warning messages. (#25840) 2022-06-17 08:41:18 +02:00
Avnish Narayan
393cf4d8f7
[RLlib] Fix action_sampler_fn call in TorchPolicyV2 (obs_batch instead of input_dict arg). (#25877) 2022-06-17 08:39:39 +02:00
Siyuan (Ryans) Zhuang
fea8dd08fc
[workflow] Enhance dataset tests (#25876) 2022-06-16 22:50:31 -07:00
sychen52
ce02ac0311
[docs] Fix example actor indentation (#25882) 2022-06-16 22:06:21 -07:00
yuduber
26b2faf869
[data] add retry logic to ray.data parquet file reading (#25673) 2022-06-16 21:49:41 -07:00
Guyang Song
974bbc0f43
[C++ worker] move xlang test to separate test file (#25756) 2022-06-17 11:05:24 +08:00
Jiao
f6735f90c7
[Ray DAG] Move dag project folder out of experimental (#25532) 2022-06-16 19:15:39 -07:00
Clark Zinzow
e111b173e9
[Datasets] Workaround for unserializable Arrow JSON ReadOptions. (#25821)
pyarrow.json.ReadOptions are not picklable until Arrow 8.0.0, which we do not yet support. This PR adds a custom serializer for this type and ensures that said serializer is registered before each Ray task submission.
2022-06-16 18:33:59 -07:00
Clark Zinzow
3dda4e1d46
[Docs] Add a py:obj default role to Sphinx builds. (#25765)
By setting the [Sphinx `default_role`](https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-default_role) to [`py:obj`](https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html#role-py-obj), we can concisely cross-reference other Python APIs (classes or functions) in API docstrings while still maintaining the editor/IDE/terminal readability of the docstrings.

Before this PR, when referencing a class or a function, the relevant role specification is required: :class:`Dataset`, :meth:`Dataset.map`, :func:`.read_parquet`.

After this PR, the raw cross reference will work in most cases: `Dataset`, `Dataset.map`, `read_parquet`.

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2022-06-16 16:33:20 -07:00
Stephanie Wang
977fff16a6
Set object spill config from env var (#25794)
Previously it was not possible to set the object spill config from the RAY_xxx environment variable, unlike other system configs. This is because the config is initialized in Python first, before the C++ config is parsed.
2022-06-16 16:08:19 -07:00
Chen Shen
8e7e89a178
[Data] fix broken link (#25867)
update the broken spark link.
2022-06-16 14:01:38 -07:00
Antoni Baum
2120d3ea09
[AIR] Change the name of GBDTTrainable dynamically (#25804) 2022-06-16 13:54:09 -07:00
Simon Mo
d83773b2f1
[Serve][AIR] Support mixed input and output type, with batching (#25688) 2022-06-16 13:00:29 -07:00
Clark Zinzow
04280d6e4e
[Datasets] Preserve cached block metadata on LazyBlockList splits. (#25745)
Preserves cached block metadata on LazyBlockList splits. Before this PR, after these splits, all block metadata would have to be re-fetched.
2022-06-16 12:36:25 -07:00
Clark Zinzow
d98adbc448
[Datasets] Fix tensor extension string formatting (repr). (#25768)
Fixes tensor extension string formatting, e.g. when invoking the DataFrame repr.
2022-06-16 12:35:11 -07:00
Kai Fricke
9b052d220e
[tune] Fix checkpoint deletion for custom syncers (#25859)
Deleting checkpoints with custom syncers was faulty and untested before this PR.
2022-06-16 19:53:43 +01:00
Simon Mo
e560bce3a4
[Serve] bind to 0.0.0.0 in serve_head (#25862) 2022-06-16 11:45:11 -07:00
Kai Fricke
c4590f3ab5
[air] Convert _TrackedCheckpoint to ray.air.Checkpoint (#25849)
We often need to convert our internal `_TrackedCheckpoint` objects to `ray.air.Checkpoint`s - we should do this in a utility method in the `_TrackedCheckpoint` class.
This PR also fixes cases where we haven't resolved the saved checkpoint futures, yet.
2022-06-16 19:31:18 +01:00
Jimmy Yao
b2e9aea908
[Ray dataset] detect dataframe dtype as object (#25811)
* fix ci

* not break master
2022-06-16 11:23:03 -07:00
Artur Niederfahrenhorst
f34cd2fd8f
[RLlib] Take replay buffer api example out of GPU examples. (#25841) 2022-06-16 19:12:38 +02:00
Clark Zinzow
56c001884d
[CI] [Core] Fix C++ lint on node_manager.cc (#25854)
Master CI lint is currently broken, this fixes the C++ lint on node_manager.cc.
2022-06-16 09:43:13 -07:00
Matti Picus
e275e8b0e7
WINDOWS: replace ':' with '$' for filename (#25767)
On windows, creating a file with a ':' in the name will fail. However '$' is fine.
2022-06-16 09:38:45 -07:00
Edward Oakes
e4352305dd
Revert "[serve] Use soft constraint for pinning controller on head node (#25091)" (#25857)
This reverts commit 0f600362dd.
2022-06-16 11:16:20 -05:00
shrekris-anyscale
d944f7469c
[Serve] [Docs] Remove references to namespaces in the Serve documentation (#25830)
#25575 starts all Serve actors in the `"serve"` namespace. This change updates the Serve documentation to remove now-outdated explanations about namespaces and to specify that all Serve actors start in the `"serve"` namespace.
2022-06-16 10:50:49 -05:00
Yi Cheng
4c5c5763ef
[ci][core] Add option for parallel ci for ray core tests (#25801)
This is the first step to enable the parallel tests for ray core ci. To reduce the noise this test only add the option and not enable them. Parallel CI can be 40%-60% faster compared with running them one-by-one.

We'll enable them by bk jobs one-by-one.

Prototype here #25612
2022-06-15 22:46:50 -07:00
Tao Wang
2d9af5028e
[Cpp worker]Support cpp call java task (#25757) 2022-06-16 10:02:46 +08:00
Stephanie Wang
1bd5b93ef4
[core] Clear backlogged deletions for spilled objects (#25742)
When objects that are spilled or being spilled are freed, we queue a request to delete them. We only clear this queue when additional objects get freed, so GC for spilled objects can fall behind when there is a lot of concurrent spilling and deletion (as in Datasets push-based shuffle). This PR fixes this to clear the deletion queue if the backlog is greater than the configured batch size. Also fixes some accounting bugs for how many bytes are pinned vs pending spill.
Related issue number

Closes #25467.
2022-06-15 18:23:21 -07:00
SangBin Cho
5fb61abba3
[Usage Stats][Hotfix] Import usage reported from workers. (#25785)
## Why are these changes needed?

We currently only record usage stats from drivers. This can lose some of information when libraries are imported from workers (e.g., doing some rllib import from trainable).

@jjyao just for the future reference.
2022-06-15 18:20:12 -07:00
matthewdeng
383954fb15
[CI] upgrade to go 1.18 (#25829)
upgrade go 1.18 to fix ci linter issue.
2022-06-15 17:31:07 -07:00
Antoni Baum
91dd360f9d
[AIR/train] Move predictors to ray.train (#25769) 2022-06-15 17:02:15 -07:00
Yi Cheng
5d77d2b160
[core][gcs] Fix the issue when gcs restarts, actor is destroyed due to bundle index equals -1 (#25789)
When GCS restarts, it'll recover the placement group and make sure no resource is leaking. The protocol now is like:

- Sending the committed PGs to raylets
- Raylets will check whether any worker is using resources from the PG not in this group
- If there is any, it'll kill that worker.

Right now there is a bug, which will kill the worker using bundle index equals -1.
2022-06-15 16:57:22 -07:00
Yi Cheng
bcb8ae9fbd
[core] Enable test_get_locations.py (#25814)
This test was skipped accidentally. This PR enabled it.
2022-06-15 16:47:09 -07:00
Antoni Baum
b5fd02af4f
[CI] Print linkcheck summary only in linkcheck (#25781) 2022-06-15 16:21:08 -07:00
Edward Oakes
0f600362dd
[serve] Use soft constraint for pinning controller on head node (#25091)
Un-reverting https://github.com/ray-project/ray/pull/24934 which caused `test_cluster` to become flaky. This was due to an oversight: we need to update the `HTTPState` logic to account for the controller not necessarily running on the head node.

This will require using the new `SchedulingPolicy` API, but I'm not quite sure the best way to do it. Context here: https://github.com/ray-project/ray/issues/25090.
2022-06-15 17:52:20 -05:00
Clark Zinzow
b51b777aae
[CI] [Datasets] [RayDP] Skip failing RayDP integration tests. (#25818)
Current causing the master Datasets CI job to fail due to a hard dependency on MLDataset, which has been deleted in Ray master. See #25816.
2022-06-15 15:20:52 -07:00