Malinda
1d789aee63
[RLlib/Serve/Release tests] Few code refactoring for better use of efficient NumPy functions. ( #26284 )
2022-07-27 22:38:35 +02:00
Jian Xiao
923209895d
Pipelined training test: change num of windows; log the ingestion perf ( #26429 )
...
Why are these changes needed?
Improve test perf
Log the perf stats
With 2 windows there are a lot of spilling, slowing down the throughput.
2022-07-11 11:03:35 -07:00
Stephanie Wang
a90e53b76f
[core] Add weekly test for 100TB random shuffle ( #25908 )
...
Adds a CI test for 100TB shuffle.
There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count.
Related issue number
Closes #24480 .
2022-07-01 13:30:07 -07:00
Eric Liang
43aa2299e6
[api] Annotate as public / move ray-core APIs to _private and add enforcement rule ( #25695 )
...
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
2022-06-21 15:13:29 -07:00
Stephanie Wang
293c122302
[dataset] Use polars for sorting ( #25454 )
2022-06-17 12:26:46 -07:00
Eric Liang
94dec83a60
[data] Rename data.impl to data._internal ( #25486 )
2022-06-06 11:39:53 -07:00
SangBin Cho
ca75570f51
Revert "Revert "Revert "[dataset] Use polars for sorting ( #24523 )" ( #24781 )" ( #25173 )" ( #25341 )
...
This reverts commit 61676f26d3
.
2022-06-01 10:49:12 -07:00
Stephanie Wang
61676f26d3
Revert "Revert "[dataset] Use polars for sorting ( #24523 )" ( #24781 )" ( #25173 )
...
Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed.
On my laptop, this makes sorting 1GB about 2x faster:
without polars
$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 50.23415923118591
...
Stage 2 sort: executed in 38.59s
Substage 0 sort_map: 100/100 blocks executed
* Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total
* Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total
* Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
* Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
Substage 1 sort_reduce: 100/100 blocks executed
* Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total
* Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total
* Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total
* Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
with polars
$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 24.097432136535645
...
Stage 2 sort: executed in 14.02s
Substage 0 sort_map: 100/100 blocks executed
* Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total
* Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total
* Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
* Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
Substage 1 sort_reduce: 100/100 blocks executed
* Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total
* Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total
* Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total
* Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
Related issue number
Closes #23612 .
2022-05-27 10:43:51 -07:00
SangBin Cho
ec653e3196
[Nightly test] Move two line downloads to one line. ( #25061 )
...
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
2022-05-22 00:07:03 -07:00
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions ( #24843 )
...
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions.
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments.
Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Clark Zinzow
ef870e936c
[Datasets] Change range_arrow()
API to range_table()
( #24704 )
...
This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.
2022-05-17 01:09:45 -07:00
Jiajun Yao
863943a540
Add perf alert for shuffle tests ( #24798 )
...
Add perf alert for shuffle tests so we can catch #24740 earlier.
2022-05-15 21:50:18 -07:00
Chen Shen
2be45fed5e
Revert "[dataset] Use polars for sorting ( #24523 )" ( #24781 )
...
This reverts commit c62e00e
.
See if reverts this resolve linux://python/ray/tests:test_actor_advanced failure.
2022-05-13 12:09:12 -07:00
Stephanie Wang
c62e00ed6d
[dataset] Use polars for sorting ( #24523 )
...
Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed.
On my laptop, this makes sorting 1GB about 2x faster:
without polars
$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 50.23415923118591
...
Stage 2 sort: executed in 38.59s
Substage 0 sort_map: 100/100 blocks executed
* Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total
* Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total
* Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
* Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
Substage 1 sort_reduce: 100/100 blocks executed
* Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total
* Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total
* Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total
* Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
with polars
$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 24.097432136535645
...
Stage 2 sort: executed in 14.02s
Substage 0 sort_map: 100/100 blocks executed
* Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total
* Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total
* Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
* Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
Substage 1 sort_reduce: 100/100 blocks executed
* Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total
* Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total
* Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total
* Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total
* Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used
Related issue number
Closes #23612 .
2022-05-12 18:35:50 -07:00
Stephanie Wang
fbbc9c33d6
Add nightly tests for push-based shuffle ( #24352 )
...
Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.
2022-05-02 11:35:14 -07:00
mwtian
afdfd20a5b
[Release tests] Create compute config for new dataset shuffle tests ( #24239 )
...
Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests.
Related issue number
#24176 : the added tests would verify the instance types which support the new shuffle implementations.
2022-04-27 11:50:12 -07:00
Chen Shen
5c461519f3
Revert "[core] Use cheaper AWS m5 instances for shuffle tests ( #23781 )"
...
This reverts commit 717e60c
and 4aa854a
2022-04-25 17:56:08 -07:00
Chen Shen
717e60cb4d
[Core][nightly-test] fix shuffle 5000 partition OOM #23997
...
closes #23992
#23781 changed the machine type where the memory capacity dropped from 128GB to 64GB and thus shuffle_1tb_5000_partitions starts OOMing.
2022-04-18 23:49:51 -07:00
Stephanie Wang
71e142b1fa
[core][tests] Add nightly test for datasets random_shuffle and sort ( #23807 )
...
Copied from #23784 .
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
Modified to fix lint.
2022-04-12 12:53:57 -07:00
Archit Kulkarni
7a1a7e1844
Revert "[core][tests] Add nightly test for datasets random_shuffle and sort ( #23784 )" ( #23805 )
...
This reverts commit ba484feac0
.
Broke lint.
2022-04-08 13:18:13 -07:00
Stephanie Wang
ba484feac0
[core][tests] Add nightly test for datasets random_shuffle and sort ( #23784 )
...
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
2022-04-08 11:31:10 -07:00
Stephanie Wang
4aa854aa23
[core] Use cheaper AWS m5 instances for shuffle tests ( #23781 )
2022-04-07 19:05:42 -07:00
SangBin Cho
47ff1241f9
[Test] Use spot instances for chaos tests. ( #23679 )
...
Use spot instances for chaos tests.
We can also experiment with other tests that don't suppose to have dead nodes, but let's do it once the nightly infra is stabilized
2022-04-06 15:56:31 -07:00
Jiajun Yao
a668e5d8db
Add perf metrics for stress tests ( #23648 )
...
Added perf metrics for stress tests so they can be alerted on.
2022-04-05 08:09:27 +09:00
Chen Shen
c3e04ab275
[nighly-test] try out spot instances for chaos test #23507
2022-03-27 20:10:21 -07:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) ( #22749 )
...
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.
RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.
Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.
Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure ( #23134 )
...
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Kai Fricke
a8bed94ed6
[ci/release] Always use full cluster address ( #23067 )
...
Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09
2022-03-11 16:31:21 +00:00
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner ( #22860 )
...
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.
The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
Stephanie Wang
1b45582e43
[tests] Enable chaos testing for Dask-on-Ray ( #22927 )
...
Turns on failures for Dask-on-Ray chaos tests.
2022-03-09 18:08:41 -05:00
SangBin Cho
9d0148dbbe
[Test] Migrate the first test to the new infra ( #22770 )
...
This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test
2022-03-06 18:24:54 -08:00
SangBin Cho
2c1184592e
mark threaded actor test unstable ( #22696 )
2022-02-28 15:25:14 -08:00
Clark Zinzow
cf3577f0ee
[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. ( #22665 )
2022-02-28 15:15:30 -08:00
Chen Shen
7e90700521
[Dataset][nighly-test] promote data ingestion test to stable #22702
2022-02-28 14:00:18 -08:00
Eric Liang
e15a419028
Enable stage fusion by default for dataset pipelines ( #22476 )
...
This PR enables stage fusion for dataset pipelines. This also requires:
1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage.
2. Removing spread_resource_prefix (not supported for now).
2022-02-23 17:34:05 -08:00
Jiajun Yao
baa14d695a
Round robin during spread scheduling ( #21303 )
...
- Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently.
- Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later.
- Prefer not to spill back tasks that are waiting for args since the pull is already in progress.
2022-02-18 15:05:35 -08:00
Chen Shen
17f589a05d
[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479
2022-02-17 15:20:39 -08:00
Chen Shen
30ec0df9cc
[placement group] fix pg benchmark regression #22441
...
We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.
2022-02-16 16:24:51 -08:00
SangBin Cho
42361a1801
[Test] Fix Dask on Ray 1 TB bug #22431 Open
...
Fixes a bug. It seems like not df is not working with dataframe
2022-02-17 02:44:36 +09:00
SangBin Cho
640d92c385
It seems like the S3 read sometimes fails; #22214 . I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
...
It seems like the S3 read sometimes fails; #22214 . I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
2022-02-12 11:58:58 +09:00
Chen Shen
0866a5558f
[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests ( #22277 )
...
* pin pyarrow==4.0.1
* address comments
2022-02-10 14:22:41 -08:00
Jiajun Yao
56c7b74072
Delete nightly shuffle_data_loader ( #22185 )
2022-02-07 15:23:34 -08:00
Jiajun Yao
355ee4a02c
Fix nightly shuffle_data_loader by pinning down dependencies versions ( #22183 )
2022-02-07 11:25:30 -08:00
Chen Shen
13819304d4
[Core][nightly-test] better way of calculating num features ( #22158 )
...
* better filter of column length
* address comments
* more
2022-02-07 02:13:40 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black ( #21975 )
...
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Chen Shen
bfe3e5f4a8
add check on shape ( #21947 )
2022-01-28 12:27:43 -08:00
Jiajun Yao
cea80b1a5b
Don't advertise cpus on gpu nodes for pipelined ingestion tests ( #21899 )
...
* Don't advertise cpus on gpu nodes for pipelined ingestion tests
* Don't advertise cpus on gpu nodes for pipelined ingestion tests
* Don't advertise cpus on gpu nodes for pipelined ingestion tests
2022-01-27 09:17:01 -08:00
SangBin Cho
ac5f38d7fd
[Test] Fix dask on ray test on K8s ( #21816 )
...
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
2022-01-24 15:09:22 -08:00
SangBin Cho
6b4aac7a08
Promote unstable tests to stable ( #21811 )
...
Promote tests that have passed 100% last 1 week to stable
2022-01-24 02:10:37 -08:00
SangBin Cho
babc03edf2
Add a threaded actor k8s test ( #21739 )
...
Add threaded actor flaky test to k8s.
2022-01-23 20:12:57 -08:00