hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Stephanie Wang	cb218d03b9	[core] Enable lineage reconstruction by default (#22816 ) Enables lineage reconstruction, which allows automatic recovery of task outputs, by default. Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).	2022-03-07 17:40:30 -05:00
Stephanie Wang	fa14120f93	Move tests out of test_object_spilling to de-flake (#22831 ) This test is timing out often in debug_mode, so moved some tests to test_object_spilling_3.	2022-03-07 17:39:55 -05:00
SangBin Cho	529911ee78	[Nightly tests] Add missing patches (#22862 ) These changes are added to the old e2e.py, but not to the new infra	2022-03-07 19:48:43 +00:00
SangBin Cho	79e8405fda	Revert "[GCS] refactor the resource related data structures on the GCS (#22817 )" (#22863 ) This reverts commit `549466a42f`.	2022-03-07 08:48:17 -08:00
shrekris-anyscale	15d97a1021	[serve] Support `init_args` and `init_kwargs` in `serve run` (#22805 ) #22714 added `serve run` to the Serve CLI. This change allows the user to specify `init_args` and `init_kwargs` in `serve run` if they are deploying via import path.	2022-03-07 09:45:17 -06:00
Jiajun Yao	1b5efb588e	[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844 ) This's easier to sort and compare timestamp and avoid timezone issue.	2022-03-07 04:54:19 -08:00
Akash Patel	5ebc32d7c2	[Core] Update grpc to 1.44.0 (#22384 ) Updates grpc to 1.44.0 to remove local patch needed for grpc to build. EDIT: there have been changes to how python is found (mostly removing python2 support) and as such the local python-patch we have for grpc needs to be modified. This time contributing it to upstream (grpc/grpc#28895) so that it'll get added in a newer version! For anyone that comes across this: Here is the error itself for why we need the grpc-python.patch file: https://buildkite.com/ray-project/ray-builders-pr/builds/24659#d293616f-225d-41f9-8de2-03780f12b13f/2386-2416	2022-03-07 04:53:48 -08:00
ZhuSenlin	549466a42f	[GCS] refactor the resource related data structures on the GCS (#22817 )	2022-03-07 18:43:33 +08:00
SangBin Cho	9d0148dbbe	[Test] Migrate the first test to the new infra (#22770 ) This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test	2022-03-06 18:24:54 -08:00
shrekris-anyscale	2490b3e383	[serve] Enable serve-decorated deployment via import path (#22839 ) Currently, classes and functions can be deployed by setting `Deployment`'s`func_or_class` to their import path. However, if these classes or functions are already decorated with `@serve.deployment`, the import path deployment will error. This change instead ignores the settings in a class or function's `@serve.deployment` decorator when deploying via import path. It takes the code definition and deploys it without erroring. It also logs a warning about the ignored settings.	2022-03-06 20:03:57 -06:00
shrekris-anyscale	521298e093	[serve] Make route prefix the deployment name by default (#22840 ) The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted. This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access. This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.	2022-03-06 20:03:31 -06:00
Jiao	2d2b5745ae	[5/X][Pipeline][Ray DAG] Make Ray InputNode more powerful with attr accessor (#22793 ) - Enhanced ray dag InputNode to take arbitrary user input via `.execute()`. - If only one value is provided, like `dag.execute(1)`, return raw value; - Otherwise wrap user input into an `DAGInputData` object that can be accessed via index or key. - User can also pass list / dict object and just access them via index [0] or key ["key"] - Introduced `InputAttrNode` that helps to connect partial attribute of user input to the DAG. - Added context manager syntax for `InputNode`. - Add InputNode enforcements with tests, such as DAG level singleton, exception with messages, etc. - Enforce only simple int or str key - Take care of JSON serialization for InputNode that carried original context manager info, ensure it's preserved. - DAGNode UUID is also preserved in JSON serde. ## Next steps On ray dag level we're proceeding with ``` with InputNode() as input: # Probably better to rename it to DAGInput() a = Model.bind(input[0]) b = Model.bind(input.x) dag = combine.bind(a, b) ``` But also enforces 1) InputNode is always used in context manager as opposed to directly created 2) There should be one and only one InputNode instance for each dag. 3) No args passed by user to InputNode at ray dag level. Then in serve we subclass a ServeInputNode() to enhance it like the following to support HTTP input validation and conversion: ``` with ServeInputNode(schema=MySchemaCls) as input: a = Model.bind(input[0]) b = Model.bind(input.x) dag = combine.bind(a, b) ``` ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2022-03-06 20:02:42 -06:00
mwtian	f67ff312a8	run mac c++ tests with static linking (#22829 ) There are problems with running C++ tests in MacOS 10.15 Catalina, when upgrading to the newest grpc due to dynamic linking: #22384 (comment). The problem does not exist for Python tests in Catalina, or in C++ tests of other systems. Upgrading MacOS CI from Catalina is also blocked in the short term: ray-project/buildkite-ci-stack#24 (comment) So working around the issue by using static linking for C++ tests on Mac.	2022-03-05 10:39:32 +09:00
Clark Zinzow	3d63313265	[Datasets] Batch across windows in DatasetPipelines. (#22830 ) This PR allows `DatasetPipeline.iter_batches()` to batch data across windows in the pipeline. This prevents partial batches from popping up in the middle of consuming a dataset pipeline due to window boundaries, and now allows us to provide the following guarantee to the user: `pipe.iter_batches()` will yield `len(pipe) // batch_size` full batches, with a partial batch occurring only (1) as the final batch and (2) only if `len(pipe) % batch_size > 0`, and if it exists, will have size `len(pipe) % batch_size`. The crux of this PR takes the block batching implementation from `Dataset.iter_batches()`, refactors it to operate on an iterator of blocks instead of a `Dataset` and pulls it out into a shared `batch_blocks()` utility, and have `DatasetPipeline.iter_batches()` use it to batch over windows by providing an iterator over all blocks in all windows.	2022-03-04 16:26:44 -08:00
Jiajun Yao	23f2862067	[Release Test] Send release test result to db pipeline for new test infra (#22813 ) * Send release test result to db pipeline for new test infra * address comment	2022-03-05 07:34:40 +09:00
Yi Cheng	5bbbfac5e8	[gcs] Fix resource updating incorrectly (#22644 ) When there is no scheduling task of scheduling class in local raylet, the backlog resource will not be reported. It usually will happen when core worker try to schedule the task on other node and report backlog to local node. This will lead to the wrong demands.	2022-03-04 14:32:54 -08:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Sven Mika	3fe6f3b3eb	[RLlib] 2 bug fixes: Bandit registration not working if torch not installed. Env checker for MA envs. (#22821 )	2022-03-04 19:16:30 +01:00
Jun Gong	e765915ded	[RLlib] Make sure SlateQ works with GPU. (#22738 )	2022-03-04 17:49:51 +01:00
Max Pumperla	b609bdf898	[docs] Improve connection between library references and their APIs (#22800 ) Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-04 16:48:03 +01:00
Eric Liang	80aac655ca	Fix flaky metric test (#22809 )	2022-03-03 20:44:50 -08:00
Siyuan (Ryans) Zhuang	d72350bfe6	[workflow] Fix different step directories are used for "workflow.wait" during recovery (#22782 ) * add test	2022-03-03 16:37:50 -08:00
Jian Xiao	b933587597	Support map_groups in dataset (#22709 ) Make Dataset capable of running map_groups(), i.e. apply a UDF on each group after a groupby() operation.	2022-03-03 15:14:00 -08:00
mwtian	55166f0780	Revert "Revert "Disable scheduler_report_pinned_bytes_only (#22132 )" (#22786 )" (#22808 ) This reverts commit `b98c9c77f1`.	2022-03-03 12:32:28 -08:00
Chen Shen	f0ba0a3d3d	[LocalResourceManager] unify (Add/Subtract)(CPU/GPU)ResourceInstances (#22777 ) * add * more	2022-03-03 09:15:49 -08:00
Antoni Baum	283666fe02	[docs] Update XGBoost/LightGBM-Ray docs (#22783 ) Brings the docs up to date with XGBoost/LightGBM-Ray readmes.	2022-03-03 18:02:43 +01:00
xwjiang2010	ee7a458762	[release test] fix horovod release test. (#22781 ) horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)]. Error message is saying: ``` AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'> ``` The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`. The two communicate via Anyscale client. The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.	2022-03-03 08:24:26 -08:00
Kai Fricke	84a163a2c4	[RLlib] Remove atari rom install script (#22797 )	2022-03-03 16:55:56 +01:00
shrekris-anyscale	71a493cf1f	[serve] Add run, delete, and status to Serve CLI (#22714 ) This change adds `run`, `delete`, and `status` commands to the CLI introduced in #22648. * `serve run`: Blocking command that allows users to deploy a YAML configuration or a class/function via import path. When terminated, the deployment(s) is torn down. Prints status info while running. Supports interactive development. * `serve delete`: Shuts down a Serve application and deletes all its running deployments. * `serve status`: Displays the status of a Serve application's deployments.	2022-03-03 09:50:36 -06:00
Jiao	76dc4ccbfd	[4/X][Pipeline] JSON serialization for serve dag nodes (#22710 ) Added JSON serde for all DAGNode types needed with tests on ray core dag as well as serve dag. See code inline comments for behavior and assumption for each.	2022-03-03 09:49:43 -06:00
Dmitri Gekhtman	991a62dd47	Operator does not retry monitor on failure. (#22792 )	2022-03-02 23:37:03 -08:00
Jialing He	207d93a52c	[runtime env] Make env_vars take effect when pip install packages (#22730 ) Previously, for the stability of pip installation, we set env to empty, but when pip installs some gzip package, maybe need env_vars. like this issue: https://github.com/ray-project/ray/issues/22610	2022-03-02 21:47:34 -06:00
mwtian	b98c9c77f1	Revert "Disable scheduler_report_pinned_bytes_only (#22132 )" (#22786 ) This reverts commit `88d2e21585`.	2022-03-02 18:29:31 -08:00
Chen Shen	e8c823791b	[scheduling-ids] enforce thread-private #22775	2022-03-02 16:27:49 -08:00
mwtian	02d09da7b4	[Core] remove verbose logs (#22785 ) IIUC, these log statements added in #22612 do not seem intended.	2022-03-02 16:00:26 -08:00
Clark Zinzow	fa44ec82f3	Add Parquet metadata resolution nightly test to test set. (#22787 )	2022-03-02 14:56:00 -08:00
Archit Kulkarni	e937f1a3c4	[runtime env] [Doc] add more details about runtime env logs (#22480 ) Clarifies the logging behavior for runtime envs, and adds the runtime env logs fileto the list of log files in the main logging page.	2022-03-02 14:27:28 -08:00
Dmitri Gekhtman	a8d8d0e1a6	Fix K8s API (#22756 ) This PR fixes K8s support by updating the api client used for ingresses.	2022-03-02 09:59:16 -08:00
Jiajun Yao	440732f267	Fix mac osx worker process not being killed by ray stop (#22758 ) For mac osx, setproctitle doesn't change the process name returned by psutil (I think it's this issue https://github.com/dvarrazzo/py-setproctitle/issues/10) but only cmdline so we need to filter by cmdline instead.	2022-03-02 09:02:48 -08:00
Kai Fricke	7425fa6212	[ci/release] Add support for concurrency groups (#22728 ) This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources. Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)	2022-03-02 16:35:54 +01:00
Jiajun Yao	04a1a19f6b	[Release Test] Send release test result to db pipeline (#22667 ) Send release test result to db pipeline Add perf metrics for microbenchmark so that we can alert on them	2022-03-02 06:19:31 -08:00
Max Pumperla	d53d0e0f50	[docs] Typo - fixes #22761 (#22763 ) Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-02 10:34:46 +01:00
Kai Fricke	a9bf5e9e2f	[ci] Update GPU docker image to Ubuntu 20.04 (#22759 ) This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.	2022-03-02 10:28:26 +01:00
Max Pumperla	7d4296c72f	run code in browser (#22727 ) Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand. If this seems useful we can extend this to other examples gradually. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-02 10:27:00 +01:00
Chen Shen	3e3db8e9cd	[scheduler] hide StringIDMap under BaseSchedulingID (#22722 ) * add * address comments	2022-03-01 22:50:53 -08:00
Yi Cheng	271ed44143	[2][resource reporting] Encapsulate poller and broadcaster into syncer in gcs (#22464 ) This PR move the poller and broadcaster from gcs server to ray syncer. TODO in next PR: deprecate the code path of placement group resource reporting and move the broadcaster out of gcs cluster resource manager.	2022-03-01 21:51:14 -08:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Stephanie Wang	d97afb9e60	[data] Pin pipeline executor actors to the driver node (#22715 ) DatasetPipeline execution is coordinated by a pool of actors and optionally the driver process. To recover from failures with lineage reconstruction, we need to keep these actors alive as long as the driver is alive. Currently, they are spread randomly throughout the cluster, so they can be killed during a node failure. This PR pins the actors to the same node as the driver so that they will survive any other node failures. It's also okay if the driver node dies, since the driver itself will also die.	2022-03-01 18:06:14 -08:00
Dmitri Gekhtman	4acbf36453	[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688 ) Closes #21353 and fixes an issue that causes dashboard to read K8s CPU requests rather than resources when determining CPUs available.	2022-03-01 17:15:59 -08:00
Eric Liang	06d4444b4a	Never re-use task workers for actors or GPU tasks (#22482 ) Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit. This adds a slight performance penalty for actor startup.	2022-03-01 16:46:18 -08:00

1 2 3 4 5 ...

11525 commits