hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-04 17:41:43 -05:00

Author	SHA1	Message	Date
Amog Kamsetty	ea6d53dbf3	[CI/AIR] Cleanup Deprecated ml_utils (#28278 ) ray.util.ml_utils was deprecated in Ray 2.0. This PR does some final cleanup of our CI pipeline Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2022-09-06 08:39:21 +01:00
Philipp Moritz	2a0ff1b4d8	[docs] Document using a different separator for read_csv (#27850 ) See discussion in https://github.com/ray-project/ray/issues/27738 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2022-09-05 16:47:22 -07:00
Yi Cheng	10e9422f8f	[core] Rename `OVERRIDE_NODE_ID_FOR_TESTING` to `RAYLET_NODE_ID` to make it a feature (#28275 ) This PR changed the OVERRIDE_NODE_ID_FOR_TESTING to RAYLET_NODE_ID so that this is a feature which can be used to start raylet with a given raylet id by setting os env RAY_RAYLET_NODE_ID.	2022-09-05 14:22:06 -07:00
Jialing He	ce70b8b96e	[Job Submission][refactor 2/N] introduce job agent (#28203 )	2022-09-03 18:42:02 +08:00
XiaodongLv	a31be7cef1	[Ray][xlang]Setting async flag for Python actor actor in Java (#28149 ) It's important that setting async flag for Python actor in Java for us. So we added the API which is named "PyActorCreator setAsync(boolean enabled)" based on PyActorCreator, To avoid misuse for user， we check the flag before the ActorCreationTask is executed.	2022-09-03 11:09:19 +08:00
shrekris-anyscale	3b7346ab50	[Runtime Environment] Parse special characters in private Git URIs (#28250 )	2022-09-02 16:37:04 -07:00
Justin Yu	9cf5df2c81	Add ray.widgets to be linked in setup dev script (#27984 )	2022-09-02 11:44:06 -07:00
Kai Fricke	5d31f2d4bc	[tune] Run SigOpt tests in CI (#28225 )	2022-09-02 09:10:01 -07:00
Kai Fricke	3590a86db0	[tune] Add timeout ro retry_fn to catch hanging syncs (#28155 ) Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-09-02 12:52:26 +01:00
Kilian Lieret	77722b86fd	[AIR] Fix deprecated import of MLflowLoggerCallback (#28247 ) Signed-off-by: Kilian Lieret <kilian.lieret@posteo.de>	2022-09-01 17:55:59 -07:00
Amog Kamsetty	b83f10dbde	[Docs] [Train] Update Train API reference and docs (#28192 ) Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com Adds back more Ray Train APIs to Ray Train docs. Also makes updates to the user guide for better references.	2022-09-01 17:47:42 -07:00
zcin	4c970cc882	[serve] Visualize Deployment Graph with Gradio (#27897 )	2022-09-01 10:46:15 -07:00
Antoni Baum	48898aa03d	[AIR][CI] Speed up HF CI by ~20% (#28208 ) Speeds up HuggingFaceTrainer/Predictor tests in CI by around ~20% by switching to a different GPT model. This is the same model Hugging Face team uses for their own CI. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-09-01 18:18:10 +01:00
Philipp Moritz	1bba65705a	[doc] Convert custom datetime column when reading a CSV file (#27854 ) Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>	2022-08-31 21:25:28 -07:00
Yi Cheng	d0b879cdb1	[workflow] Change name in step to task_id (#28151 ) We've deprecated the name options and use task_id. This is the cleanup to fix everything left.	2022-08-31 20:27:32 -07:00
clarng	65fdd720f9	[core] memory monitor observability improvements: add metrics and log message (#27716 ) Add more observability and record events when the raylet kills a task or actor due to memory usage going above threshold.	2022-08-31 13:50:40 -07:00
Artur Niederfahrenhorst	f420407b0d	[ML] Pin Pydantic <= 1.9.2 (#28205 ) CI is red because of a dependency issue around dataclass_transform . Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-08-31 13:35:18 -07:00
xwjiang2010	958c22a0b0	[tune] Update GPU warning message in tune. (#28167 ) Mention scaling config / with resources Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-08-31 12:29:09 -07:00
Jiajun Yao	5e2437923d	[Core] Remove unused args for default_worker.py (#28177 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-30 21:43:02 -07:00
Peyton Murray	ffe12a5f10	[Tune] Add rich output for ray tune progress updates in notebooks (#26263 ) These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook. Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template. Signed-off-by: pdmurray <peynmurray@gmail.com>	2022-08-30 15:09:40 -07:00
Balaji Veeramani	dad98dcabd	[AIR] Add `TorchCheckpoint.from_state_dict` (#27970 ) PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this. Signed-off-by: Balaji Veeramani balaji@anyscale.com	2022-08-30 13:05:30 -07:00
Antoni Baum	d7f712d202	[AIR] Split train dataset in `HuggingFaceTrainer` (#28170 ) https://github.com/ray-project/ray/pull/25428 inadvertently turned off train dataset splitting for the `HuggingFaceTrainer`, which meant it wasn't actually running in a data parallel fashion. This PR fixes that. Signed-off-by: Antoni Baum antoni.baum@protonmail.com	2022-08-30 12:35:44 -07:00
SangBin Cho	f74f155af4	Revert "Revert "Revert "[serve][xlang]Support deploying Python deploy… (#28153 ) this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR …ment from Java. …" (#27945)" This reverts commit `af488e1`.	2022-08-30 12:00:29 -07:00
Kai Fricke	42dc034503	[ci] Pin moto to >= 4.0.0, adjust API (#28099 ) If this passes, it should be preferred over #28098. Adjust moto setup to use new API. Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com>	2022-08-30 11:39:32 -07:00
Antoni Baum	13457dab03	[AIR] Fix HF checkpointing with same-node workers (#28154 ) If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-08-30 11:24:13 -07:00
Ian Rodney	adf875b4ce	[Cleanup] Update Put error message (#28050 ) We allow tasks to return ObjectRefs. I'm not sure when this support was added, but I think for quite a while.	2022-08-30 08:35:20 -07:00
Jiajun Yao	2c6a960733	Don't include script directory in sys.path if it's started via python -m (#28140 ) Redo #28043 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-30 08:33:27 -07:00
Ian Rodney	8934a8d32b	[Raylet][Cleanup] Remove Extra Indent & Fix Typo (#28073 ) * Rename `is_existing` to `is_exiting` * Redundant `if statement`. This is covered by: `6bedaa5c87/python/ray/_raylet.pyx (L581)`	2022-08-29 15:32:36 -07:00
Amog Kamsetty	acc4903db1	[AIR/Serve] Auto-enable GPU Prediction (#26549 ) Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment. Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2022-08-29 13:47:56 -07:00
ZhuSenlin	c7a3bcc232	[Core] fix resource leak when cancel actor in phase of creating (#27742 ) t is quite easy to cause resource/process leak when cancel an actor which constructor is time-consuming.	2022-08-29 09:38:16 -07:00
Jiajun Yao	e6b0d5f95d	Revert "Don't include script directory in sys.path if it's started via python -m (#28043 )" (#28139 ) This reverts commit `b41ee37c3a`.	2022-08-26 21:37:21 -07:00
Kilian Lieret	67a7481972	[docs/tune] Fix loguniform range in tune tutorial (#28131 )	2022-08-26 17:08:00 -07:00
Amog Kamsetty	00f6273775	[Docs] [Tune] `ResultGrid` Docs and API reference (#28068 ) Improve docstring for ResultGrid and show API reference and docstring in Tune API section. Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-26 16:50:35 -07:00
Kai Fricke	f59dcdc049	[tune] Re-enable progress metric detection (#28130 ) The API cleanup in #27060 introduced a regression when merging latest master - changes from #26967 were effectively disabled, retaining cluttered output in rllib with verbose=2. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 16:44:20 -07:00
Kai Fricke	d0678b80ed	[rfc] [air/tune/train] Improve trial/training failure error printing (#27946 ) When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run. The main problems here are: 1. Tracebacks are printed multiple times: In the remote worker and on the driver 2. Tracebacks include many internal wrappers The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured). The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code. ### Deduplicating traceback printing The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable. Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist). To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default. ### Removing internal wrappers from tracebacks The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 15:02:38 -07:00
Antoni Baum	ea483ecf7a	[AIR][Docs] Clarify how LGBM/XGB trainers work (#28122 )	2022-08-26 14:51:22 -07:00
Kai Fricke	3b3aa80ba3	[tune/ci] Fix link to SigOpt experiment API (#28127 ) Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 14:10:53 -07:00
Jiajun Yao	b41ee37c3a	Don't include script directory in sys.path if it's started via python -m (#28043 ) According to https://peps.python.org/pep-0338/ > The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module. We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.	2022-08-26 13:27:08 -07:00
Jiajie Li	6c69ee9a97	Add actor_id in RayActorError (#27802 ) For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error. Signed-off-by: Jiajie Li <ljjsalt@gmail.com>	2022-08-26 10:46:08 -07:00
Kai Fricke	cf94a31e7a	[CI] Pin moto to < 4.0.0. (#28098 )	2022-08-25 07:55:25 -07:00
Kai Fricke	e0725d1f1d	[docs/ci] Fix (some) broken linkchecks (#28087 ) Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-25 04:41:35 -07:00
Steven Morad	ad2bf69548	[AIR; RLlib] Log histograms in wandb. (#28081 )	2022-08-24 08:21:14 -07:00
Cheng Su	debe0cc91f	[Datasets] Re-enable Parquet sampling and add progress bar (#28021 )	2022-08-22 16:59:26 -07:00
Dmitri Gekhtman	227aef381a	Update Kuberay version in CI. (#27967 ) Updates KubeRay version used in CI to v0.3.0-rc.2 (which we expect to be identical to the final v0.3.0). Also removes a couple of old files. Will open a corresponding cherry pick in the Ray 2.0.0 branch. The key thing to verify is that the CI autoscaling test passes here and in the PR and in the PR against the 2.0.0 branch.	2022-08-20 14:50:52 -07:00
Alex Wu	f886d9737c	[autoscaler][observability] Provide more detailed events when autoscaler fails to launch a node. (#27891 ) This PR makes the autoscaler event system for node launches more detailed. In particular, it does 4 related things: Less verbose logging for node provider exceptions (printed to logs only, not driver) Don't print to driver "adding 1 node(s) of type ..." when nodes don't launch (still print it if the node launch is successful). Print to driver "Failed to launch ..." Don't log a full exception to the driver. The full driver event looks like this ``` Failed to launch 1 node(s) of type quota. (InsufficientInstanceCapacity): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c. ``` Co-authored-by: Alex <alex@anyscale.com>	2022-08-19 16:27:02 -07:00
xiaofeng	af488e1cc2	Revert "Revert "[serve][xlang]Support deploying Python deployment from Java. …" (#27945 )	2022-08-18 17:57:37 -07:00
Dmitri Gekhtman	98c90b8488	[clusters][docs] Provide urls to content, fix typos (#27936 )	2022-08-18 11:33:04 -07:00
Jian Xiao	440ae620eb	Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964 ) There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime. This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.	2022-08-18 10:25:31 -07:00
Cheng Su	45e5e8c6ea	[Datasets] Customized serializer for Arrow JSON ParseOptions in read_json (#27911 ) This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821. Signed-off-by: Cheng Su <scnju13@gmail.com>	2022-08-18 10:00:56 -07:00
Chen Shen	6be4bf8be3	[hotfix] Fix pytest dependency in test_utils (#27956 ) import pytest in test_utils breaks a bunch of test.	2022-08-17 12:16:08 -07:00

1 2 3 4 5 ...

7584 commits