hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	d93fa95dd5	[ci/release] Only report results for scheduled builds (#23135 ) Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.	2022-03-14 15:10:16 +00:00
Kai Fricke	fce49694fc	[ci/release] Disable infra retries for now (#23132 ) Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now. We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.	2022-03-14 11:51:11 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
SangBin Cho	2c2d96eeb1	[Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.	2022-03-14 03:49:15 -07:00
Jiaxin Shan	8823ca48b4	[Workflow] Improve workflow docs (#23114 ) * [Workflow] Improve workflow docs * Update doc/source/workflows/concepts.rst Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>	2022-03-13 18:55:45 -07:00
Jiajun Yao	e4620669a1	[Release Test] Add perf metrics for core scalability tests (#23110 ) * Add perf metrics for core scalability tests * lint	2022-03-14 10:20:39 +09:00
Amog Kamsetty	86b79b68be	[ml/train] Training Interfaces [2/4]: Update interface for `Trainer` (#22986 )	2022-03-13 18:09:50 -07:00
Scott Graham	f673acb0ad	Scgraham/azure docs (#22296 ) Fixes potential error if function not found in azure sdk when deploying ray cluster on azure Adds additional python package needed to deploy ray cluster on azure in docs Co-authored-by: Scott Graham <scgraham@microsoft.com>	2022-03-13 18:08:08 -07:00
Antoni Baum	5d3fc5a677	[ML] Add `XGBoostPredictor` & `LightGBMPredictor` interfaces (#23073 ) Adds `XGBoostPredictor` and `LightGBMPredictor` interfaces.	2022-03-13 15:22:52 -07:00
Antoni Baum	f4ffba8a78	[ML] Add `TensorflowPredictor` interface (#23070 ) Adds interface for TensorflowPredictor.	2022-03-13 15:20:03 -07:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	956ad95d67	[ci/release] Fix release test config (#23122 ) Currently the test is failing due to an invalid config (merged before validation was properly enforced).	2022-03-13 19:48:34 +00:00
Kai Fricke	c7303f538c	[ci/release] Validate smoke test fields, enforce frequency (#23075 ) Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.	2022-03-13 18:48:03 +00:00
Kai Fricke	76a939c820	[ci/release] Migrate long running (+distributed) tests (#22955 ) Migrating to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/103 Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_	2022-03-13 18:47:17 +00:00
Kai Yang	e9755d87a6	[Lint] One parameter/argument per line for C++ code (#22725 ) It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments. In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line. There are several benefits: * Conflict resolving is easier. * Less potential human mistakes when resolving conflicts. * Git history and Git blame are more straightforward. * Better readability. * Align with the new Python format style.	2022-03-13 17:05:44 +08:00
Yi Cheng	f15bcb21dc	Update code owners of gcs and workflow (#23086 ) Update the code owners to include people working on this module.	2022-03-12 12:24:52 -08:00
SangBin Cho	8c1a6f9138	[Nightly Test] Fix a dataset test (#23106 ) Fix a broken dataset test (due to incorrect working dir)	2022-03-12 08:16:08 -08:00
Siyuan (Ryans) Zhuang	9f607c2165	Revert "Revert "[workflow] Convert DAG to workflow (#22925 )"" (#23095 ) * Revert "Revert "[workflow] Convert DAG to workflow (#22925)" (#23081)" This reverts commit `28d597e009`. * rename _bind() -> bind() * rename _apply_recursive() -> apply_recursive()	2022-03-12 02:08:25 -08:00
Chong-Li	f7e1343d39	[GCS] Fix the normal task resources at GCS (#22857 ) * Fix the normal task resources at GCS * Fix comments * Leave a TODO * Bring back a UT * consider object memory * Fix Co-authored-by: Chong-Li <lc300133@antgroup.com>	2022-03-11 21:54:03 -08:00
jon-chuang	0b54d9c780	[GCS] Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762 ) Previously, placement group had suboptimal bin-packing resulting in unexpected placement group stalls for users. The root cause is lack of implementation for sorting of pg bundles by resource priority and size. This PR implements a naive priority mechanism for bundles that can be improved upon (and even config by user in the future) in the GCS resource scheduler. The behaviour is to schedule: "GPU" first, custom resources in int64_t order next, and finally, memory and then "CPU" last.	2022-03-11 21:47:07 -08:00
Jiajun Yao	4016dba3d3	Add usage stats heads up message (#22985 )	2022-03-11 21:37:22 -08:00
SangBin Cho	c0f8de9c3c	[Nightly tests] Run benchmark tests on k8s as well (#23100 ) Run benchmark tests on k8s as well. Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration	2022-03-11 19:40:37 -08:00
mwtian	aad6f41593	[Tune] Remove unused autogluon requirement (#16587 ) `autogluon` does not support Python 3.9. And Ray seems to not import it anywhere.	2022-03-11 16:54:23 -08:00
SangBin Cho	97383e4c1b	[Nightly test] Fix a broken nightly test due to the wrong config (#23097 )	2022-03-11 16:47:06 -08:00
Amog Kamsetty	2294a7ed47	[ml] `TorchPredictor` interface (#22990 )	2022-03-11 16:00:53 -08:00
Siyuan (Ryans) Zhuang	be7ccb7dac	[core][serialization] Fix registering serializer before initializing Ray. (#23031 ) * Support registering serializer before initializing Ray. * add test	2022-03-11 15:13:18 -08:00
Yi Cheng	4f86b5b523	[gcs] Remove `use_gcs_for_bootstrap` in core (python) and autoscaler (#23050 ) This is part of cleanup PR for Redisless Ray. This PR remove use_gcs_for_bootstrap in core and autoscaler.	2022-03-11 14:36:16 -08:00
Peng Yu	252ba6cecd	Correct documentation in ActorPoolStrategy (#23079 )	2022-03-11 13:27:55 -08:00
Simon Mo	2f2fc97bd1	Don't symlink Serve in setup-dev (#23092 )	2022-03-11 13:21:00 -08:00
Jeroen Bédorf	bc21a4593d	[RLlib] Fix crash when kl_coeff is set to 0 (#23063 ) Co-authored-by: Jeroen Bédorf <jeroen@minds.ai> Co-authored-by: Ishant Mrinal Haloi <mrinal.haloi11@gmail.com> Co-authored-by: Ishant Mrinal <33053278+n30111@users.noreply.github.com>	2022-03-11 12:24:52 -08:00
Jian Xiao	e9ae784e62	Make schema() read non-disruptive to iter_datasets() (#23032 ) Currently, reading schema of DatasetPipeline is disruptive and will invalidate the iter_datasets().	2022-03-11 12:01:24 -08:00
Patrick Ames	1d48c8dc75	[Datasets] Support dataset metadata provider callbacks in read APIs. (#22896 ) These changes add Dataset Read API support for (1) specifying custom block metadata provider callbacks, and (2) skipping path expansion. When paired with a custom block metadata provider that maintains an in-memory cache of BlockMetadata for each input file path, these changes reduced average S3-based dataset read times for production [Redshift Manifests](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) stored in Amazon's internal data catalog by over 90%. A simple ParquetDatasource benchmark reading 144MM records across 100 ~70MiB (on-disk) Parquet files stored in S3 showed an ~75% reduction in read latency (from 4.62 seconds to 1.18 seconds on 2 r5n.8xlarge EC2 nodes).	2022-03-11 11:52:56 -08:00
xwjiang2010	5d776b00e6	[tuner] fix result_grid (#23078 )	2022-03-11 11:34:44 -08:00
xwjiang2010	f270d84094	[AIR] switch to a common RunConfig. (#23076 )	2022-03-11 10:55:36 -08:00
SangBin Cho	2b38fe89e2	[Nightly tests] Migrate rest of core tests (#23085 ) MIgrate the rest of core tests	2022-03-11 10:41:14 -08:00
Kai Fricke	04ea180dfb	[ci/release] Add "tiny" concurrency group, change limits (#23065 ) E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups. Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).	2022-03-11 10:19:38 -08:00
Stephanie Wang	28d597e009	Revert "[workflow] Convert DAG to workflow (#22925 )" (#23081 ) This reverts commit `0a9f966e63`.	2022-03-11 09:49:08 -08:00
shrekris-anyscale	665bdbff47	[serve] Exclude unset fields from Ray actor options (#23059 ) The `schema_to_deployment()` function preserve unset fields with unexpected default argument types. This change excludes unset fields in that function and also changes the dictionaries' default values to empty dicts.	2022-03-11 10:45:21 -06:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
Kenneth	07372927cc	Enable buffering and spilling to multiple remote storages (#22798 ) Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally. In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes. Buffer Size \| Runtime (s) -- \| -- Default \| 3221.865916 256KB \| 1758.885839 1MB \| 748.226089 10MB \| 526.406466 100MB \| 494.830513 Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size. For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes. Prefixes \| Runtime (s) -- \| -- 1 \| 748.226089 3 \| 527.658646 10 \| 516.010742 Together these changes enable faster large-scale object spilling. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-11 11:27:02 -05:00
Kai Fricke	61295f8b58	[ml/checkpoint] Fix checkpoint location on remote node (#23068 ) Currently breaks tests where the checkpoint is stored on a remote node (e.g. via Ray client), e.g.: https://buildkite.com/ray-project/release-tests-branch/builds/132#6a4936a8-41dd-4fd2-9f02-976855cbd9b7 Instead, we can set the properties manually. In the future, we need a story on how to refer to checkpoints kept on remote nodes.	2022-03-11 15:38:21 +00:00
Jialing He	0cbbb8c1d0	[runtime env][core] Use Proto message `RuntimeEnvInfo` between user code and core_worker (#22856 )	2022-03-11 22:14:18 +08:00
SangBin Cho	965d609627	[Nightly test] Fix a minor syntax issue for core nightly tests (#23069 ) Add frequency to smoke tests Remove unnecessary alerts	2022-03-11 04:58:40 -08:00
Kai Fricke	5b2d58674b	[ci/release] Migrate horovod tests (#22951 ) Migrating horovod tests to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/125	2022-03-11 09:53:29 +00:00
Kai Fricke	aed17dd346	Revert "Revert "[ml/tune] Expose new checkpoint interface to users (#22741 )" (#23006 )" (#23009 ) This reverts commit `85598d9d10`. Test breakage was unrelated.	2022-03-11 09:51:41 +00:00
Tao Wang	10c03cb126	Migrating to flat hash map [GCS&util&common] (#22932 ) Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules. The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated. The follow-up PRs would be migrating in core worker, placement group and others.	2022-03-11 18:35:06 +09:00
Yi Cheng	ec88eb7d1d	[4][resource reporting] Remove ray syncer from gcs_resource_manager (#22832 ) This PR is part of resource reporting refactoring. In this PR ray syncer is moved from gcs_resource_manager to gcs_placement_group_scheduler. With this one, gcs_resource_manager is totally decoupled from resource broadcasting.	2022-03-11 01:15:25 -08:00
Jialing He	0c5440ee72	[runtime env] Deletes the proto cache on RuntimeEnv (#22944 ) Mainly the following things: - This PR deletes the proto cache on RuntimeEnv, ensuring that the user's modification of RuntimeEnv can take effect in the Proto message. - validate whole runtime env when serialize runtime_env. - overload method `__setitem__` to parse and validate field when it has to modify.	2022-03-11 15:37:18 +08:00
matthewdeng	3a3a7b4be4	[test] add back deleted datasets train test file (#23051 )	2022-03-10 21:46:07 -08:00
Amog Kamsetty	f80602b7d2	[Datasets] Separate pandas to torch conversion in `to_torch` (#22939 ) Separate out the conversion of pandas dataframe to torch tensor in a utility function so that the same logic can be used in other places in Ray ML (for example during inference).	2022-03-10 20:40:01 -08:00

1 2 3 4 5 ...

11671 commits