Commit graph

5269 commits

Author SHA1 Message Date
Linsong Chu
ce64e6dc45
[workflow] add metadata put in workflow (#19195)
## Why are these changes needed?

Add metadata to workflow.  Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.

We are adding various of metadata including:

1. step-level user metadata.  can be set with `step.options(metadata={})`
2. step-level pre-run metadata.  this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata.  this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with  `workflow.run(metadata={})`
5. workflow-level pre-run metadata.  this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata.  this captures post-run metadata such as workflow_end_time, more metrics can be added later.

## Related issue number

https://github.com/ray-project/ray/issues/17090

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-10-12 21:01:24 -07:00
Clark Zinzow
1b179adfa1
[Core] [Hotfix] Handle logging redirected to stdout when configuring log file (#19301) 2021-10-12 19:03:21 -07:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… (#19330) 2021-10-12 19:02:30 -07:00
Clark Zinzow
df6d06bd41
Fix for LazyBlockList refactor. (#19333) 2021-10-12 18:18:45 -07:00
Amog Kamsetty
09d8049584
[SGD] Make actor creation async (#19325)
* fix

* fix

* fix
2021-10-12 16:15:59 -07:00
Eric Liang
9f1cd9e867
[docs] Document fake multi-node autoscaler (#19329) 2021-10-12 15:59:07 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration (#17626)
* wip

* wip

* wip

* draft

* disable tf autosharding

* wip

* wip

* wip

* wip

* add example

* wip

* wip

* wip

* use dataset.split

* add unit tests

* add linear example

* concatenate tensors and fix example

* WIP tune example

* add tensorflow example

* wip

* random_shuffle_each_window

* fault tolerance test

* GPU, examples, CI

* formatting

* fix

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* wip

* type hints

* wip

* update user guide

* fix

* fix immediate issues

* update example

* update

* fix tune gpu test

* fix resources for smoke test - 1 CPU for dataset tasks

* update tests, docs, examples

* Apply suggestions from code review

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* address comments

* add warning

* fix tests

* minor doc updates

* update example in doc

* configure tests

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* Update python/ray/data/dataset.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docstring

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Carlo Grisetti
7651cc782a
Change prometheus warning filename source (#19275)
* Change prometheus warning filename source

* Fix linting
2021-10-12 14:02:51 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group (#19277)" (#19327)
This reverts commit 4360b99803.
2021-10-12 12:41:51 -07:00
Lixin Wei
f2f9c749cb
[Build] Add an Option to Skip Bazel Build (#19265) 2021-10-12 12:01:58 -07:00
Eric Liang
0ab6749602
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00
SangBin Cho
4360b99803
[Placement Group] Fix the high load bug from the placement group (#19277) 2021-10-12 11:04:14 -07:00
Clark Zinzow
6ca3c02041
[Datasets] Parallelize Parquet metadata fetches. (#19211) 2021-10-12 11:02:30 -07:00
dependabot[bot]
74ee99ff99
[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183)
* [RLlib](deps): Bump tensorflow in /python/requirements/rllib

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.0 to 2.6.0.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.6.0)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* wip.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-12 17:56:36 +02:00
SangBin Cho
2c93708324
Migrating to flat hash map [Raylet] (#19220)
* done

* Fix all unit tests

* done

* .

* Fix the build issue

* fix the compilation bug
2021-10-12 07:41:51 -07:00
Wansoo Kim
0f6d4661d7
[tune] Port all MNIST examples to specify data_dir (#19033)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-12 15:36:06 +01:00
gjoliver
5d14904b9b
[Tune] catch HTTPError when logging to wandb. (#19314) 2021-10-12 14:38:17 +01:00
Kai Fricke
d8d8901192
[ci/tune] Remove deprecated jenkins_only tag from test tags (#19287) 2021-10-12 10:05:46 +01:00
Chris K. W
35230ea9fa
[client] deflake test_stdout_log_stream (#19232)
* deflake test_stdout_log_stream

* add assert message
2021-10-11 22:22:39 -07:00
architkulkarni
cc16e8f8c5
[runtime env] Validate "excludes" field (#19302) 2021-10-11 20:05:22 -07:00
Jiao
85b8a6de5f
[Serve] Add nightly test for Serve failure recovery (#19125) 2021-10-11 18:33:20 -07:00
Carlo Grisetti
c2377fb725
[Serve] Call without loop parameter if python 3.10+ (#19298) 2021-10-11 18:31:13 -07:00
Eric Liang
6cacc54774
[RFC] Fake multi-node mode for autoscaler (#18987) 2021-10-11 18:27:29 -07:00
SangBin Cho
0d7a7a06c0
[Placement group] Warm up the cluster before running the unit test #19286 (#19286) 2021-10-11 16:26:52 -07:00
Carlo Grisetti
2d0355548e
[Dashboard] Try to work around aiohttp 4.0.0 breaking changes (#19120) 2021-10-11 16:25:52 -07:00
Patrick Ames
a43193b9e5
[data] Add support for Arrow open input/output stream kwargs. (#19197) 2021-10-11 15:38:15 -07:00
Chen Shen
c740aae54c
[Core][Dataset] adding example for large scale data ingestion (#18998) 2021-10-11 15:37:09 -07:00
Jiajun Yao
92516981ea
[core] Increase worker lease parallelism (#18647) 2021-10-11 15:34:32 -07:00
Amog Kamsetty
b3ad72643c
[Tune] Call on_trial_complete after final checkpoint (#19243) 2021-10-11 09:47:39 -07:00
Kai Fricke
6252a6c1f9
[tune] Force no result buffering for hyperband schedulers (#19140) 2021-10-11 16:56:11 +01:00
Guyang Song
ab55b808c5
[runtime env] move worker env to runtime env in Java (#19060) 2021-10-11 17:25:09 +08:00
Shantanu
0c4603f836
[core] nicer error message for unpickleable exceptions (#17936)
* [core] nicer error message for unpickleable exceptions

I ran into a case where we had an exception that wasn't unpickleable:
```
pickle.loads(pickle.dumps(filelock.Timeout()))
```

When a filelock.Timeout is raised on a worker, it gets surfaced in a way
that makes ray look like it was responsible:
```
ray.exceptions.RaySystemError: System error: __init__() missing 1 required positional argument: 'lock_file'
```

This PR turns the following stacktrace:
```
    return ray.get(refs, timeout=timeout)
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get
    raise value
ray.exceptions.RaySystemError: System error: __init__() missing 1 required positional argument: 'lock_file'
traceback: Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 254, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 213, in _deserialize_object
    return RayError.from_bytes(obj)
  File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 28, in from_bytes
    return pickle.loads(ray_exception.serialized_exception)
TypeError: __init__() missing 1 required positional argument: 'lock_file'
```

into this:
```
  ...
    return ray.get(refs, timeout=timeout)
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get
    raise value
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 29, in from_bytes
    return pickle.loads(ray_exception.serialized_exception)
TypeError: __init__() missing 1 required positional argument: 'lock_file'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 254, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 213, in _deserialize_object
    return RayError.from_bytes(obj)
  File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 31, in from_bytes
    raise RuntimeError("Failed to unpickle serialized exception") from e
RuntimeError: Failed to unpickle serialized exception
```

* lint

* test_unpickleable_stacktrace

* lint

* .

* .

Co-authored-by: hauntsaninja <>
2021-10-11 01:19:19 -07:00
SangBin Cho
3b865b463a
[Core] Fix GPU first scheduling that is not working with placement group (#19141)
* done

* Revert "done"

This reverts commit 56b18f0a7d14c5466d726c3ed1264f3e1506771e.

* ip

* Revert "Revert "done""

This reverts commit a34c90b0920893f4efbf171b8159f0d08a10dca0.

* Done

* Remove unnecessary log message

* skip test on windows

* Handle the code review.
2021-10-11 00:12:25 -07:00
Sasha Sobol
e8d1fc36cb
Make binbacking prioritize nodes better (#19212)
* Make binbacking prioritize nodes better

Make binpacking prefer nodes that match multiple
resource types.

* spelling

* order demands when binpacking, starting from complex ones

* add stability to resource demand ordering

* lint

* logging

* add comments

* +comment

* use set
2021-10-10 14:56:47 -04:00
Guyang Song
bae543c956
[runtime env] support eager_install in runtime env (#17949) 2021-10-09 17:59:57 +08:00
Eric Liang
a92f1fedf4
Revert "[tune/wip] Exclude trial checkpoints in experiment sync (#19185)" (#19245)
This reverts commit 44b0b6eb20.
2021-10-08 19:47:12 -07:00
Eric Liang
b59317520d
Revert "[Workflow] workflow.delete (#19178)" (#19247)
This reverts commit 7ea512f343.
2021-10-08 19:12:55 -07:00
Alex Wu
7ea512f343
[Workflow] workflow.delete (#19178)
Why are these changes needed?
This PR implements workflow.delete which allows users to delete the information in storage related to a workflow. (This assumes the workflow isn't currently running).

Related issue number
Closes #18848
2021-10-08 16:11:59 -07:00
Jiajun Yao
c31f0e17e6
Replace ray.__commit__ with the actual commit SHA when we build the windows (#19213)
wheel
2021-10-08 16:06:52 -07:00
Sven Mika
d439fd7f17
[RLlib] TF2/eager memory leak fixes. (#19198) 2021-10-09 00:11:53 +02:00
Edward Oakes
47447c71e0
[serve] Remove excessive backend_state.update() calls in unit tests (#19225)
These extra update cycles are no longer needed now that we removed the SHOULD_START and SHOULD_STOP states.
2021-10-08 16:36:44 -05:00
mwtian
b066627539 [Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 (#19216) 2021-10-08 12:58:18 -07:00
Patrick Ames
fa047c050b
[data] Make directory creation in dataset output path optional. (#19202) 2021-10-08 12:36:10 -07:00
Edward Oakes
9cf19b67cc
[serve] Remove log poll client from replicas (#19145)
In general, broadcasting changes to the replicas via the LongPollClient is hard to reason about (it circumvents our versioning semantics as there's no rolling update). Ideally we would only be using the LongPollClient to broadcast replica membership and nothing else.
2021-10-08 12:32:42 -05:00
Edward Oakes
86d1a5bfc6
[serve] Catch ConnectionError during shutdown in LongPollClient (#19224) 2021-10-08 12:31:35 -05:00
Edward Oakes
93bcea7bdd
[serve] Clean up kv store file, skip on windows (#19194) 2021-10-08 12:30:48 -05:00
Kai Fricke
44b0b6eb20
[tune/wip] Exclude trial checkpoints in experiment sync (#19185) 2021-10-08 18:26:03 +01:00
Kai Fricke
e5e1ba93d9
[tune] Use queue to display JupyterNotebookReporter updates in Ray client (#19137) 2021-10-08 18:23:20 +01:00
Antoni Baum
c7d6f838f6
[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once (#19144) 2021-10-08 18:16:26 +01:00
Eric Liang
8beabb283b
Force disable placement_group for all dataset tasks (#19208) 2021-10-08 10:16:09 -07:00