Commit graph

11694 commits

Author SHA1 Message Date
Jialing He
207d93a52c
[runtime env] Make env_vars take effect when pip install packages (#22730)
Previously, for the stability of pip installation, we set env to empty, but when pip installs some gzip package, maybe need env_vars.  like this issue: https://github.com/ray-project/ray/issues/22610
2022-03-02 21:47:34 -06:00
mwtian
b98c9c77f1
Revert "Disable scheduler_report_pinned_bytes_only (#22132)" (#22786)
This reverts commit 88d2e21585.
2022-03-02 18:29:31 -08:00
Chen Shen
e8c823791b
[scheduling-ids] enforce thread-private #22775 2022-03-02 16:27:49 -08:00
mwtian
02d09da7b4
[Core] remove verbose logs (#22785)
IIUC, these log statements added in #22612 do not seem intended.
2022-03-02 16:00:26 -08:00
Clark Zinzow
fa44ec82f3
Add Parquet metadata resolution nightly test to test set. (#22787) 2022-03-02 14:56:00 -08:00
Archit Kulkarni
e937f1a3c4
[runtime env] [Doc] add more details about runtime env logs (#22480)
Clarifies the logging behavior for runtime envs, and adds the runtime env logs fileto the list of log files in the main logging page.
2022-03-02 14:27:28 -08:00
Dmitri Gekhtman
a8d8d0e1a6
Fix K8s API (#22756)
This PR fixes K8s support by updating the api client used for ingresses.
2022-03-02 09:59:16 -08:00
Jiajun Yao
440732f267
Fix mac osx worker process not being killed by ray stop (#22758)
For mac osx, setproctitle doesn't change the process name returned by psutil (I think it's this issue https://github.com/dvarrazzo/py-setproctitle/issues/10) but only cmdline so we need to filter by cmdline instead.
2022-03-02 09:02:48 -08:00
Kai Fricke
7425fa6212
[ci/release] Add support for concurrency groups (#22728)
This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources.

Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)
2022-03-02 16:35:54 +01:00
Jiajun Yao
04a1a19f6b
[Release Test] Send release test result to db pipeline (#22667)
Send release test result to db pipeline
Add perf metrics for microbenchmark so that we can alert on them
2022-03-02 06:19:31 -08:00
Max Pumperla
d53d0e0f50
[docs] Typo - fixes #22761 (#22763)
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:34:46 +01:00
Kai Fricke
a9bf5e9e2f
[ci] Update GPU docker image to Ubuntu 20.04 (#22759)
This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.
2022-03-02 10:28:26 +01:00
Max Pumperla
7d4296c72f
run code in browser (#22727)
Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand.
If this seems useful we can extend this to other examples gradually.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:27:00 +01:00
Chen Shen
3e3db8e9cd
[scheduler] hide StringIDMap under BaseSchedulingID (#22722)
* add

* address comments
2022-03-01 22:50:53 -08:00
Yi Cheng
271ed44143
[2][resource reporting] Encapsulate poller and broadcaster into syncer in gcs (#22464)
This PR move the poller and broadcaster from gcs server to ray syncer. 

TODO in next PR: deprecate the code path of placement group resource reporting and move the broadcaster out of gcs cluster resource manager.
2022-03-01 21:51:14 -08:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Stephanie Wang
d97afb9e60
[data] Pin pipeline executor actors to the driver node (#22715)
DatasetPipeline execution is coordinated by a pool of actors and optionally the driver process. To recover from failures with lineage reconstruction, we need to keep these actors alive as long as the driver is alive. Currently, they are spread randomly throughout the cluster, so they can be killed during a node failure.

This PR pins the actors to the same node as the driver so that they will survive any other node failures. It's also okay if the driver node dies, since the driver itself will also die.
2022-03-01 18:06:14 -08:00
Dmitri Gekhtman
4acbf36453
[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688)
Closes #21353 and
fixes an issue that causes dashboard to read K8s CPU requests rather than resources when determining CPUs available.
2022-03-01 17:15:59 -08:00
Eric Liang
06d4444b4a
Never re-use task workers for actors or GPU tasks (#22482)
Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit.

This adds a slight performance penalty for actor startup.
2022-03-01 16:46:18 -08:00
Eric Liang
5a0b7a7ee0
Document Dataset pipeline stage fusion (#22737) 2022-03-01 14:38:09 -08:00
Eric Liang
1a170f7234
[RFC] Disable actor queueing warning for concurrent actors (#22720)
The warning was not implemented properly for out of order actors. Disable it for now.
2022-03-01 14:28:19 -08:00
Sven Mika
0af100ffae
[RLlib] Fix tree.flatten dict ordering bug: flatten_space([obs_space]) should produce same struct as tree.flatten([obs]). (#22731) 2022-03-01 21:24:24 +01:00
Eric Liang
e228544d39
Undo revert of windowing dataset by bytes (#22735) 2022-03-01 12:24:04 -08:00
Archit Kulkarni
127b69bc21
[runtime env] Fix protobuf serialization/deserialization (#22672)
This PR fixes some minor bugs in `to_dict` and `from_dict` for the runtime env protobuf and adds a test to cover this codepath.  The test checks that `to_dict` and `from_dict` are inverses.  This PR contains all fixes required to make the test pass.
2022-03-01 12:34:50 -06:00
Kenneth
9b67cb5a6f
Add buffering to object spilling (#22618)
This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes.

### Performance Changes
A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling.

| Run | Produce (s) | Consume (s) | Total (s) |
| -- | -- | -- | -- |
| Baseline (original) | 347.332281 | 355.611272 | 705.560750 |
| Baseline (w/ fix) | 181.815852 | 347.692850 | 532.847759 |
| No fusing (original) | 453.574554 | 525.047998 | 981.620108 |
| No fusing (w/ fix) | 452.614848| 519.787698 | 975.412639 |

The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled.

See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
2022-03-01 10:13:10 -08:00
Eric Liang
482b0117e8
Basic log observability for spilling (#22612) 2022-03-01 09:40:51 -08:00
Edward Oakes
2a09561edf
[serve] Enable REST API tests with main clause (#22706)
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
2022-03-01 11:21:22 -06:00
Sven Mika
e50bd212a1
[RLlib] Disable flakey Pendulum-v1 tests (until further investigation). (#22686) 2022-03-01 16:44:17 +01:00
Daniel
8d1f1b0a64
[RLlib] Update pettingzoo==1.15.0 supersuit==3.3.3 (#22519) 2022-03-01 11:23:27 +01:00
simonsays1980
568cf28dd4
[RLlib] Example script custom_metrics_and_callbacks.py should work for batch_mode=complete_episodes. (#22684) 2022-03-01 09:00:38 +01:00
Jun Gong
e8be45065e
[RLlib] Restore policies on eval_workers as well. (#22641) 2022-03-01 08:38:14 +01:00
Simon Mo
0bab8dbfe0
[Serve] Add test for controller managing Java Replica (#22628) 2022-02-28 23:13:56 -08:00
Kai Fricke
d06c3ffd6f
[release] Migrate Tune + XGBoost tests to new infrastructure (#22705)
Migrate XGBoost and Tune tests to new release testing infrastructure.

https://buildkite.com/ray-project/release-tests-branch/builds/50
2022-03-01 08:10:06 +01:00
Chen Shen
7b22d662df
[clean up ClusterResourceScheduler 2/n] Introduce random policy in the scheduling policy #22712 2022-02-28 20:38:55 -08:00
Chen Shen
dfcb0f5de5
[clean up ClusterResourceScheduler 1/n] move IsSchedulable logic into ClusterResourceManager #22711 2022-02-28 20:37:56 -08:00
Jian Xiao
aeb0a0dcbe
Add a static factory method to BlockBuilder to instantiate concrete builders (#22634)
This is useful in combining multiple applied groups produced by groupby().map_groups() into a single one. For example, builder = BlockBuilder.for_block(type(batch)), and then for each applied group, builder.add_block(applied_group).
2022-02-28 19:00:24 -08:00
Simon Mo
00935275ae
[Serve] Autoscaling: basic intelligent scale down (#22669) 2022-02-28 20:46:06 -06:00
shrekris-anyscale
49ee443231
[serve] Add Serve CLI commands for REST API (#22648) 2022-02-28 20:45:46 -06:00
Stephanie Wang
73f078236f
[doc] Update docs about actor garbage collection (#20763)
Update outdated actor docs about when actors are GCed.
2022-02-28 18:45:29 -08:00
Jian Xiao
7597f1590b
[Dataset] fix some comments (#22700) 2022-02-28 17:13:43 -08:00
Jiaxin Shan
32829ff9ad
[KubeRay] Provide a new Dockerfile for fast build (#22689)
Adds a new Dockerfile for fast build and development of KubeRay.
2022-02-28 17:09:16 -08:00
Archit Kulkarni
85657b1377
[Doc] [Jobs] add CLI and SDK reference to docs (#22680) 2022-02-28 17:57:46 -06:00
Chris K. W
fa6b3c7c89
[aws][autoscaler] fix regional default AMI's (#22506)
The AMI's for ray.head.default and ray.worker.default in defaults.yaml supersede the default AMI for the region (defaults get merged in before _check_ami is called, causes problems if region isn't us-west-2). Removes the default AMI from defaults.yaml, and aborts if user doesn't specify an AMI in a region without a default.
2022-02-28 15:52:57 -08:00
jon-chuang
3bc0858a4f
[Core/GCS] remove default 100 concurrent rate limit for heartbeat (#22613)
better scalability

Closes https://github.com/ray-project/ray/issues/20773
2022-02-28 15:26:05 -08:00
SangBin Cho
2c1184592e
mark threaded actor test unstable (#22696) 2022-02-28 15:25:14 -08:00
Clark Zinzow
cf3577f0ee
[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665) 2022-02-28 15:15:30 -08:00
Chen Shen
7e90700521
[Dataset][nighly-test] promote data ingestion test to stable #22702 2022-02-28 14:00:18 -08:00
Simon Mo
fe3d501d68
[Core] Include java worker log with log monitor (#22629) 2022-02-28 12:30:04 -08:00
Kai Fricke
3695408a85
[release] Fix special cases in release test package (e.g. smoke test) (#22442)
Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.
2022-02-28 21:05:01 +01:00
SangBin Cho
ba4f1423c7
Revert "Support creating a DatasetPipeline windowed by bytes (#22577)" (#22695)
This reverts commit b5b4460932.
2022-02-28 11:56:12 -08:00