hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Amog Kamsetty	e1f24a244b	[ml/train] Training Interfaces [3/4]: `DataParallelTrainer` interface (#22988 ) Interface for DataParallelTrainer and updates to ScalingConfig definition. Depends on #22986 Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2022-03-15 08:11:05 -07:00
Qing Wang	f51cb09e02	[Core][Java][Remove JVM FullGC 2/N] Make JVM be aware of in-memory store pressure. (#21441 )	2022-03-15 19:25:27 +08:00
Max Pumperla	ad30123339	[docs] fix includes for md files (#23180 ) the include of content for md files like our central getting started page didn't render. fixed here. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-15 11:09:18 +00:00
Pamphile Roy	81b17669a4	[core][docs] Document port/IP binding and slurm concerns (#22663 ) Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later). I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.	2022-03-15 01:43:46 -07:00
Guyang Song	f65971756d	[dashboard agent] Catch agent port conflict (#23024 )	2022-03-15 16:09:15 +08:00
Chen Shen	5a2ebc281c	[Scheduler] separate scheduler code to its own build target (#23124 ) * wip * comments * fix build * fix-test * fix format	2022-03-14 23:23:58 -07:00
Kai Yang	35c7275bfc	[Object Spilling] Handle IO worker failures correctly (#20752 ) Currently, when a spill/restore worker fails and the state of it in the worker pool is idle, the worker pool will not clean up the metadata of the worker. Subsequent spill/restore requests will reuse this dead worker and RPC requests cannot succeed. This results in broken object spilling functionality. This PR addresses the issue by removing disconnected IO workers from `registered_io_workers` and `idle_io_workers`.	2022-03-15 12:14:14 +08:00
Kai Yang	041f98d5dd	Fix or remove unnecessary `action_env` settings in `.bazelrc` (#21307 ) `PATH` is easy to be changed in a terminal session. Different `$PATH` values lead to miss of bazel cache. e.g. `pip install python -e` and `bazel build //:all` don't share cache because Python modifies `PATH`. `LC_ALL`, `LANG`, and python-related environment variables are only used by C++ worker tests, which invokes the `ray start` command when running tests with `bazel test`. Java worker is not affected because we don't use `bazel test` to run Java tests. So these env variables should stay `test_env`, not `action_env`. This PR can greatly improve the cache hit rate of Bazel build and test.	2022-03-15 12:13:13 +08:00
Jules S. Damji	0246f3532e	[DOC] Added a full example how to access deployments (#22401 )	2022-03-14 21:15:52 -05:00
mwtian	6eb805b357	[CI] remove GCS-Ray CI tests (#23149 ) * remove redis ci tests * remove mac	2022-03-14 18:18:59 -07:00
Antoni Baum	447a98eed1	[ML] `TensorflowPredictor` implementation (#23146 ) Implementation for TensorflowPredictor.	2022-03-14 17:02:21 -07:00
Archit Kulkarni	5ecd88e2e0	[runtime env] Keep existing `PYTHONPATH` when using runtime env (#23144 )	2022-03-14 18:59:50 -05:00
Stephanie Wang	7235541393	[Datasets] Use multithreading to submit DatasetPipeline stages (#22912 ) Previously DatasetPipeline stages were executed by one actor each, which compromised fault tolerance through lineage reconstruction. This centralizes all task submission at the pipeline coordinator to improve fault tolerance. To preserve pipeline parallelism, the stages are executed by a threadpool. To clean up the threadpool, the pipeline coordinator adds any running threads to a global set that is checked by the threads during `ray.wait`. Note that this will only provide fault tolerance for split pipes if all pipeline consumers stay alive. It will not work if one of the consumers dies and restarts because next_dataset_if_ready is not idempotent. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-03-14 16:57:02 -07:00
Edward Oakes	f646d3fc31	[serve] Add unimplemented interfaces for Deployment DAG APIs (#23125 ) Adds the following interfaces (without implementation, for discussion / approval): - `serve.Application` - `serve.DeploymentNode` - `serve.DeploymentMethodNode`, `serve.DAGHandle`, and `serve.drivers.PipelineDriver` - `serve.run` & `serve.build` In addition to these Python APIs, we will also support the following CLI commands: - `serve run [--blocking=true] my_file:my_node_or_app # Uses Ray client, blocking by default.` - `serve build my_file:my_node output_path.yaml` - `serve deploy [--blocking=false] # Uses REST API, non-blocking by default.` - `serve status [--watch=false] # Uses REST API, non-blocking by default.`	2022-03-14 18:53:08 -05:00
Amog Kamsetty	154edce2a4	[ml] Don't require preprocessor in TorchPredictor (#23163 )	2022-03-14 16:33:22 -07:00
Antoni Baum	6a1e336b24	[tune] Add CV support for XGB/LGBM Tune callbacks (#22882 ) Adds an ability for users to specify a custom results post-processing function that will be applied to metrics before they are reported to Tune in XGBoost/LightGBM integration callbacks, allowing for support for xgb.cv/lgbm.cv. Updates example to show it in action and in CI.	2022-03-14 21:00:39 +00:00
Archit Kulkarni	e8496374e2	[Jobs] Test job submit with no specified ray address (#23119 )	2022-03-14 13:44:06 -05:00
Edward Oakes	5d501e3b28	[serve] Polish help info on the CLI (#23026 ) Closes https://github.com/ray-project/ray/issues/23015	2022-03-14 12:38:17 -05:00
Amog Kamsetty	7dcba48034	[ml] `TorchPredictor` implementation (#23123 ) Implementation for TorchPredictor.	2022-03-14 10:28:22 -07:00
Kai Fricke	15aeb33e50	[ci/release] Support PR wheels (#23084 ) This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.	2022-03-14 17:24:13 +00:00
Jialing He	39a6c054d3	[runtime env][feature] introduce pip_check_enable and pip_version (#22826 )	2022-03-14 23:41:19 +08:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	d93fa95dd5	[ci/release] Only report results for scheduled builds (#23135 ) Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.	2022-03-14 15:10:16 +00:00
Kai Fricke	fce49694fc	[ci/release] Disable infra retries for now (#23132 ) Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now. We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.	2022-03-14 11:51:11 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
SangBin Cho	2c2d96eeb1	[Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.	2022-03-14 03:49:15 -07:00
Jiaxin Shan	8823ca48b4	[Workflow] Improve workflow docs (#23114 ) * [Workflow] Improve workflow docs * Update doc/source/workflows/concepts.rst Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>	2022-03-13 18:55:45 -07:00
Jiajun Yao	e4620669a1	[Release Test] Add perf metrics for core scalability tests (#23110 ) * Add perf metrics for core scalability tests * lint	2022-03-14 10:20:39 +09:00
Amog Kamsetty	86b79b68be	[ml/train] Training Interfaces [2/4]: Update interface for `Trainer` (#22986 )	2022-03-13 18:09:50 -07:00
Scott Graham	f673acb0ad	Scgraham/azure docs (#22296 ) Fixes potential error if function not found in azure sdk when deploying ray cluster on azure Adds additional python package needed to deploy ray cluster on azure in docs Co-authored-by: Scott Graham <scgraham@microsoft.com>	2022-03-13 18:08:08 -07:00
Antoni Baum	5d3fc5a677	[ML] Add `XGBoostPredictor` & `LightGBMPredictor` interfaces (#23073 ) Adds `XGBoostPredictor` and `LightGBMPredictor` interfaces.	2022-03-13 15:22:52 -07:00
Antoni Baum	f4ffba8a78	[ML] Add `TensorflowPredictor` interface (#23070 ) Adds interface for TensorflowPredictor.	2022-03-13 15:20:03 -07:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	956ad95d67	[ci/release] Fix release test config (#23122 ) Currently the test is failing due to an invalid config (merged before validation was properly enforced).	2022-03-13 19:48:34 +00:00
Kai Fricke	c7303f538c	[ci/release] Validate smoke test fields, enforce frequency (#23075 ) Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.	2022-03-13 18:48:03 +00:00
Kai Fricke	76a939c820	[ci/release] Migrate long running (+distributed) tests (#22955 ) Migrating to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/103 Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_	2022-03-13 18:47:17 +00:00
Kai Yang	e9755d87a6	[Lint] One parameter/argument per line for C++ code (#22725 ) It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments. In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line. There are several benefits: * Conflict resolving is easier. * Less potential human mistakes when resolving conflicts. * Git history and Git blame are more straightforward. * Better readability. * Align with the new Python format style.	2022-03-13 17:05:44 +08:00
Yi Cheng	f15bcb21dc	Update code owners of gcs and workflow (#23086 ) Update the code owners to include people working on this module.	2022-03-12 12:24:52 -08:00
SangBin Cho	8c1a6f9138	[Nightly Test] Fix a dataset test (#23106 ) Fix a broken dataset test (due to incorrect working dir)	2022-03-12 08:16:08 -08:00
Siyuan (Ryans) Zhuang	9f607c2165	Revert "Revert "[workflow] Convert DAG to workflow (#22925 )"" (#23095 ) * Revert "Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)" This reverts commit `28d597e009`. * rename _bind() -> bind() * rename _apply_recursive() -> apply_recursive()	2022-03-12 02:08:25 -08:00
Chong-Li	f7e1343d39	[GCS] Fix the normal task resources at GCS (#22857 ) * Fix the normal task resources at GCS * Fix comments * Leave a TODO * Bring back a UT * consider object memory * Fix Co-authored-by: Chong-Li <lc300133@antgroup.com>	2022-03-11 21:54:03 -08:00
jon-chuang	0b54d9c780	[GCS] Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762 ) Previously, placement group had suboptimal bin-packing resulting in unexpected placement group stalls for users. The root cause is lack of implementation for sorting of pg bundles by resource priority and size. This PR implements a naive priority mechanism for bundles that can be improved upon (and even config by user in the future) in the GCS resource scheduler. The behaviour is to schedule: "GPU" first, custom resources in int64_t order next, and finally, memory and then "CPU" last.	2022-03-11 21:47:07 -08:00
Jiajun Yao	4016dba3d3	Add usage stats heads up message (#22985 )	2022-03-11 21:37:22 -08:00
SangBin Cho	c0f8de9c3c	[Nightly tests] Run benchmark tests on k8s as well (#23100 ) Run benchmark tests on k8s as well. Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration	2022-03-11 19:40:37 -08:00
mwtian	aad6f41593	[Tune] Remove unused autogluon requirement (#16587 ) `autogluon` does not support Python 3.9. And Ray seems to not import it anywhere.	2022-03-11 16:54:23 -08:00
SangBin Cho	97383e4c1b	[Nightly test] Fix a broken nightly test due to the wrong config (#23097 )	2022-03-11 16:47:06 -08:00
Amog Kamsetty	2294a7ed47	[ml] `TorchPredictor` interface (#22990 )	2022-03-11 16:00:53 -08:00
Siyuan (Ryans) Zhuang	be7ccb7dac	[core][serialization] Fix registering serializer before initializing Ray. (#23031 ) * Support registering serializer before initializing Ray. * add test	2022-03-11 15:13:18 -08:00
Yi Cheng	4f86b5b523	[gcs] Remove `use_gcs_for_bootstrap` in core (python) and autoscaler (#23050 ) This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.	2022-03-11 14:36:16 -08:00
Peng Yu	252ba6cecd	Correct documentation in ActorPoolStrategy (#23079 )	2022-03-11 13:27:55 -08:00

1 2 3 4 5 ...

11693 commits