hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Eric Liang	94dec83a60	[data] Rename data.impl to data._internal (#25486 )	2022-06-06 11:39:53 -07:00
SangBin Cho	ca75570f51	Revert "Revert "Revert "[dataset] Use polars for sorting (#24523 )" (#24781 )" (#25173 )" (#25341 ) This reverts commit `61676f26d3`.	2022-06-01 10:49:12 -07:00
Stephanie Wang	61676f26d3	Revert "Revert "[dataset] Use polars for sorting (#24523 )" (#24781 )" (#25173 ) Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed. On my laptop, this makes sorting 1GB about 2x faster: without polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 50.23415923118591 ... Stage 2 sort: executed in 38.59s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used with polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 24.097432136535645 ... Stage 2 sort: executed in 14.02s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Related issue number Closes #23612.	2022-05-27 10:43:51 -07:00
SangBin Cho	ec653e3196	[Nightly test] Move two line downloads to one line. (#25061 ) It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later	2022-05-22 00:07:03 -07:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Clark Zinzow	ef870e936c	[Datasets] Change `range_arrow()` API to `range_table()` (#24704 ) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.	2022-05-17 01:09:45 -07:00
Jiajun Yao	863943a540	Add perf alert for shuffle tests (#24798 ) Add perf alert for shuffle tests so we can catch #24740 earlier.	2022-05-15 21:50:18 -07:00
Chen Shen	2be45fed5e	Revert "[dataset] Use polars for sorting (#24523 )" (#24781 ) This reverts commit `c62e00e`. See if reverts this resolve linux://python/ray/tests:test_actor_advanced failure.	2022-05-13 12:09:12 -07:00
Stephanie Wang	c62e00ed6d	[dataset] Use polars for sorting (#24523 ) Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed. On my laptop, this makes sorting 1GB about 2x faster: without polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 50.23415923118591 ... Stage 2 sort: executed in 38.59s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used with polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 24.097432136535645 ... Stage 2 sort: executed in 14.02s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Related issue number Closes #23612.	2022-05-12 18:35:50 -07:00
Stephanie Wang	fbbc9c33d6	Add nightly tests for push-based shuffle (#24352 ) Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.	2022-05-02 11:35:14 -07:00
mwtian	afdfd20a5b	[Release tests] Create compute config for new dataset shuffle tests (#24239 ) Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests. Related issue number #24176: the added tests would verify the instance types which support the new shuffle implementations.	2022-04-27 11:50:12 -07:00
Chen Shen	5c461519f3	Revert "[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )" This reverts commit `717e60c` and `4aa854a`	2022-04-25 17:56:08 -07:00
Chen Shen	717e60cb4d	[Core][nightly-test] fix shuffle 5000 partition OOM #23997 closes #23992 #23781 changed the machine type where the memory capacity dropped from 128GB to 64GB and thus shuffle_1tb_5000_partitions starts OOMing.	2022-04-18 23:49:51 -07:00
Stephanie Wang	71e142b1fa	[core][tests] Add nightly test for datasets random_shuffle and sort (#23807 ) Copied from #23784. Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory. Modified to fix lint.	2022-04-12 12:53:57 -07:00
Archit Kulkarni	7a1a7e1844	Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 )" (#23805 ) This reverts commit `ba484feac0`. Broke lint.	2022-04-08 13:18:13 -07:00
Stephanie Wang	ba484feac0	[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 ) Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.	2022-04-08 11:31:10 -07:00
Stephanie Wang	4aa854aa23	[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )	2022-04-07 19:05:42 -07:00
SangBin Cho	47ff1241f9	[Test] Use spot instances for chaos tests. (#23679 ) Use spot instances for chaos tests. We can also experiment with other tests that don't suppose to have dead nodes, but let's do it once the nightly infra is stabilized	2022-04-06 15:56:31 -07:00
Jiajun Yao	a668e5d8db	Add perf metrics for stress tests (#23648 ) Added perf metrics for stress tests so they can be alerted on.	2022-04-05 08:09:27 +09:00
Chen Shen	c3e04ab275	[nighly-test] try out spot instances for chaos test #23507	2022-03-27 20:10:21 -07:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
SangBin Cho	ebac18d163	[Nightly test] Support Job based file manager + runner (#22860 ) This PR supports the job-based file manager and runner. It will be the backbone of k8s migration. The PR handles edge cases that originally existed in the old e2e.py job-based runners.	2022-03-10 15:03:50 -08:00
Stephanie Wang	1b45582e43	[tests] Enable chaos testing for Dask-on-Ray (#22927 ) Turns on failures for Dask-on-Ray chaos tests.	2022-03-09 18:08:41 -05:00
SangBin Cho	9d0148dbbe	[Test] Migrate the first test to the new infra (#22770 ) This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test	2022-03-06 18:24:54 -08:00
SangBin Cho	2c1184592e	mark threaded actor test unstable (#22696 )	2022-02-28 15:25:14 -08:00
Clark Zinzow	cf3577f0ee	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
Chen Shen	7e90700521	[Dataset][nighly-test] promote data ingestion test to stable #22702	2022-02-28 14:00:18 -08:00
Eric Liang	e15a419028	Enable stage fusion by default for dataset pipelines (#22476 ) This PR enables stage fusion for dataset pipelines. This also requires: 1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage. 2. Removing spread_resource_prefix (not supported for now).	2022-02-23 17:34:05 -08:00
Jiajun Yao	baa14d695a	Round robin during spread scheduling (#21303 ) - Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently. - Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later. - Prefer not to spill back tasks that are waiting for args since the pull is already in progress.	2022-02-18 15:05:35 -08:00
Chen Shen	17f589a05d	[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479	2022-02-17 15:20:39 -08:00
Chen Shen	30ec0df9cc	[placement group] fix pg benchmark regression #22441 We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.	2022-02-16 16:24:51 -08:00
SangBin Cho	42361a1801	[Test] Fix Dask on Ray 1 TB bug #22431 Open Fixes a bug. It seems like not df is not working with dataframe	2022-02-17 02:44:36 +09:00
SangBin Cho	640d92c385	It seems like the S3 read sometimes fails; #22214 . I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue. It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.	2022-02-12 11:58:58 +09:00
Chen Shen	0866a5558f	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 ) * pin pyarrow==4.0.1 * address comments	2022-02-10 14:22:41 -08:00
Jiajun Yao	56c7b74072	Delete nightly shuffle_data_loader (#22185 )	2022-02-07 15:23:34 -08:00
Jiajun Yao	355ee4a02c	Fix nightly shuffle_data_loader by pinning down dependencies versions (#22183 )	2022-02-07 11:25:30 -08:00
Chen Shen	13819304d4	[Core][nightly-test] better way of calculating num features (#22158 ) * better filter of column length * address comments * more	2022-02-07 02:13:40 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Chen Shen	bfe3e5f4a8	add check on shape (#21947 )	2022-01-28 12:27:43 -08:00
Jiajun Yao	cea80b1a5b	Don't advertise cpus on gpu nodes for pipelined ingestion tests (#21899 ) * Don't advertise cpus on gpu nodes for pipelined ingestion tests * Don't advertise cpus on gpu nodes for pipelined ingestion tests * Don't advertise cpus on gpu nodes for pipelined ingestion tests	2022-01-27 09:17:01 -08:00
SangBin Cho	ac5f38d7fd	[Test] Fix dask on ray test on K8s (#21816 ) Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).	2022-01-24 15:09:22 -08:00
SangBin Cho	6b4aac7a08	Promote unstable tests to stable (#21811 ) Promote tests that have passed 100% last 1 week to stable	2022-01-24 02:10:37 -08:00
SangBin Cho	babc03edf2	Add a threaded actor k8s test (#21739 ) Add threaded actor flaky test to k8s.	2022-01-23 20:12:57 -08:00
SangBin Cho	02af73a571	[Test] First core nightly test migration to k8s (#21698 ) The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.	2022-01-19 13:31:49 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
Chen Shen	74d4e7c20c	install botocore with s3fs to ensure no confliction (#21680 )	2022-01-18 23:09:16 -08:00
Jiajun Yao	bb04cc9d80	Use latest cmake for pipelined_ingestion and pipelined_training tests (#21674 )	2022-01-18 12:03:43 -08:00
Jiajun Yao	76b91efd9b	Fix wrong many_nodes_actor_test app config (#21404 ) RAY_GCS_ACTOR_SCHEDULING_ENABLED is wrong should be RAY_gcs_actor_scheduling_enabled. Since gcs based actor scheduling is not enabled yet so I just removed this flag.	2022-01-05 11:52:13 -08:00

1 2 3

149 commits