hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

No description

Find a file

Jian Xiao 2878119ece Optimize groupby/mapgroups performance (#27805 ) For the following script, it took 75-90 mins to finish the groupby().map_groups() before, and with this PR it finishes in less than 10 seconds. The slowness came from the `get_boundaries` routine which linearly loop over each row in the Pandas DataFrame (note: there's just one block in the script below, which had multiple million rows). We make it 1) operate on numpy arrow, 2) use binary search and 3) use native impl of bsearch from numpy. ``` import argparse import time import ray import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from pyarrow import fs from pyarrow import dataset as ds from pyarrow import parquet as pq import pyarrow as pa import ray def transform_batch(df: pd.DataFrame): # Drop nulls. df['pickup_at'] = pd.to_datetime(df['pickup_at'], format='%Y-%m-%d %H:%M:%S') df['dropoff_at'] = pd.to_datetime(df['dropoff_at'], format='%Y-%m-%d %H:%M:%S') df['trip_duration'] = (df['dropoff_at'] - df['pickup_at']).dt.seconds df['pickup_location_id'].fillna(-1, inplace = True) df['dropoff_location_id'].fillna(-1, inplace = True) return df def train_test(rows): # if the group is too small, it cannot be split for train/test if len(rows.index) < 4: print(f"Dataframe for LocID: {rows.index} is empty") else: train, test = train_test_split(rows) train_X = train[["dropoff_location_id"]] train_y = train[['trip_duration']] test_X = test[["dropoff_location_id"]] test_y = test[['trip_duration']] reg = LinearRegression().fit(train_X, train_y) reg.score(train_X, train_y) pred_y = reg.predict(test_X) reg.score(test_X, test_y) error = np.mean(pred_y-test_y) # format output in dataframe (the same format as input) data = [[reg.coef_, reg.intercept_, error]] return pd.DataFrame(data, columns=["coef", "intercept", "error"]) start = time.time() rds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/01/", columns=['pickup_at', 'dropoff_at', "pickup_location_id", "dropoff_location_id"]) rds = rds.map_batches(transform_batch, batch_format="pandas") grouped_ds = rds.groupby("pickup_location_id") results = grouped_ds.map_groups(train_test) taken = time.time() - start ```		2022-08-17 11:08:18 -07:00
.buildkite	Revert "Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications"" (#27662 )	2022-08-12 15:12:20 -07:00
.github	[docs] Add codeowners for subdirectories (#27569 )	2022-08-05 11:37:15 -07:00
.gitpod	[CI] Check test files for `if __name__...` snippet (#25322 )	2022-06-02 10:30:00 +01:00
bazel	[runtime env] plugin refactor [7/n]: support runtime env in C++ API (#27010 )	2022-07-27 18:24:31 +08:00
binder	run code in browser (#22727 )	2022-03-02 10:27:00 +01:00
ci	[core] Add stats for the gcs backend for telemetry. (#27876 )	2022-08-16 17:02:04 -07:00
cpp	[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449 )	2022-08-05 16:07:13 -07:00
dashboard	[core] Don't override external dashboard URL in internal KV store (#27901 )	2022-08-16 22:48:05 -07:00
deploy	[K8s/Autoscaler] Added a field for the service account name (#27004 )	2022-07-26 19:47:18 -07:00
doc	[docs][serve] Fix linkcheck for production guide (#27941 )	2022-08-17 07:46:53 -07:00
docker	[Docker] Add Cuda 11.6 support (#26695 )	2022-07-26 10:15:53 -07:00
java	[Serve]Fix classloader bug in Java Deployment (#27899 )	2022-08-16 15:22:00 +08:00
python	Optimize groupby/mapgroups performance (#27805 )	2022-08-17 11:08:18 -07:00
release	[air/benchmarks] Measure local training time in torch/tf benchmarks (#27902 )	2022-08-16 19:16:08 +02:00
rllib	[RLlib] Remove unneeded args from offline learning examples. (#26666 )	2022-08-17 17:59:27 +02:00
scripts	[CI] Add bazel py_test checking for Serve (#25509 )	2022-06-07 10:54:10 -07:00
src	[Core] Suppress gRPC server alerting on too many keep-alive pings (#27769 )	2022-08-17 01:53:47 -07:00
thirdparty	Revert "Revert "[grpc] Upgrade grpc to 1.45.2"" (#24201 )	2022-04-26 10:49:54 -07:00
.bazelrc	[runtime env] plugin refactor[6/n]: java api refactor (#26783 )	2022-07-26 09:00:57 +08:00
.clang-format	[Lint] One parameter/argument per line for C++ code (#22725 )	2022-03-13 17:05:44 +08:00
.clang-tidy	[Lint] Disable `modernize-use-override` (#19368 )	2021-10-13 20:20:08 -07:00
.editorconfig	Improve .editorconfig entries (#7344 )	2020-02-26 19:05:36 -08:00
.flake8	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 )	2022-01-23 17:53:41 +08:00
.git-blame-ignore-revs	Create `.git-blame-ignore-revs` for black formatting (#25118 )	2022-05-23 21:55:57 -07:00
.gitignore	[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449 )	2022-08-05 16:07:13 -07:00
.gitpod.yml	[dev] Enable gitpod (#15420 )	2021-04-21 13:26:46 -07:00
.isort.cfg	Update import sorting blacklist, enable sorting for experimental dir (#26101 )	2022-07-12 21:25:58 -07:00
build-docker.sh	Bump Ray Version from 2.0.0.dev0 to 3.0.0.dev0 (#24894 )	2022-05-17 19:31:05 -07:00
BUILD.bazel	Replace boost::filesystem with std::filesystem (#27522 )	2022-08-04 21:33:51 -07:00
build.sh	Get rid of build shell scripts and move them to Python (#6082 )	2020-07-16 11:26:47 -05:00
CONTRIBUTING.rst	Link to the documentation on contributing from CONTRIBUTING.rst (#19396 )	2021-11-15 15:34:18 -08:00
LICENSE	[State Observability] Use a table format by default (#26159 )	2022-07-19 00:54:16 -07:00
pylintrc	RLLIB and pylintrc (#8995 )	2020-06-17 18:14:25 +02:00
README.rst	[docs] Minor polish on AIR getting started page (#27696 )	2022-08-09 11:24:18 -07:00
SECURITY.md	Create SECURITY.md (#21521 )	2022-01-11 08:54:51 -08:00
setup_hooks.sh	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 )	2022-04-13 18:11:30 +01:00
WORKSPACE	[CI] Bump Bazel version to 4.2.2 (#24242 )	2022-05-26 17:09:40 -07:00

README.rst

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png

.. image:: https://readthedocs.org/projects/ray/badge/?version=master
    :target: http://docs.ray.io/en/master/?badge=master

.. image:: https://img.shields.io/badge/Ray-Join%20Slack-blue
    :target: https://forms.gle/9TSdDYUgxYs8SA9e8

.. image:: https://img.shields.io/badge/Discuss-Ask%20Questions-blue
    :target: https://discuss.ray.io/

.. image:: https://img.shields.io/twitter/follow/raydistributed.svg?style=social&logo=twitter
    :target: https://twitter.com/raydistributed

|

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute:

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/what-is-ray-padded.svg

..
  https://docs.google.com/drawings/d/1Pl8aCYOsZCo61cmp57c7Sja6HhIygGCvSZLi_AuBuqo/edit

Learn more about `Ray AIR`_ and its libraries:

- `Datasets`_: Distributed Data Preprocessing
- `Train`_: Distributed Training
- `Tune`_: Scalable Hyperparameter Tuning
- `RLlib`_: Scalable Reinforcement Learning
- `Serve`_: Scalable and Programmable Serving

Or more about `Ray Core`_ and its key abstractions:

- `Tasks`_: Stateless functions executed in the cluster.
- `Actors`_: Stateful worker processes created in the cluster.
- `Objects`_: Immutable values accessible across the cluster.

Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing
`ecosystem of community integrations`_.

Install Ray with: ``pip install ray``. For nightly wheels, see the
`Installation page <https://docs.ray.io/en/latest/installation.html>`__.

.. _`Serve`: https://docs.ray.io/en/latest/serve/index.html
.. _`Datasets`: https://docs.ray.io/en/latest/data/dataset.html
.. _`Workflow`: https://docs.ray.io/en/latest/workflows/concepts.html
.. _`Train`: https://docs.ray.io/en/latest/train/train.html
.. _`Tune`: https://docs.ray.io/en/latest/tune/index.html
.. _`RLlib`: https://docs.ray.io/en/latest/rllib/index.html
.. _`ecosystem of community integrations`: https://docs.ray.io/en/latest/ray-overview/ray-libraries.html


Why Ray?
--------

Today's ML workloads are increasingly compute-intensive. As convenient as they are, single-node development environments such as your laptop cannot scale to meet these demands.

Ray is a unified way to scale Python and AI applications from a laptop to a cluster.

With Ray, you can seamlessly scale the same code from a laptop to a cluster. Ray is designed to be general-purpose, meaning that it can performantly run any kind of workload. If your application is written in Python, you can scale it with Ray, no other infrastructure required.

More Information
----------------

- `Documentation`_
- `Ray Architecture whitepaper`_
- `Exoshuffle: large-scale data shuffle in Ray`_
- `Ownership: a distributed futures system for fine-grained tasks`_
- `RLlib paper`_
- `Tune paper`_

*Older documents:*

- `Ray paper`_
- `Ray HotOS paper`_

.. _`Ray AIR`: https://docs.ray.io/en/latest/ray-air/getting-started.html
.. _`Ray Core`: https://docs.ray.io/en/latest/ray-core/walkthrough.html
.. _`Tasks`: https://docs.ray.io/en/latest/ray-core/tasks.html
.. _`Actors`: https://docs.ray.io/en/latest/ray-core/actors.html
.. _`Objects`: https://docs.ray.io/en/latest/ray-core/objects.html
.. _`Documentation`: http://docs.ray.io/en/latest/index.html
.. _`Ray Architecture whitepaper`: https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview
.. _`Exoshuffle: large-scale data shuffle in Ray`: https://arxiv.org/abs/2203.05072
.. _`Ownership: a distributed futures system for fine-grained tasks`: https://www.usenix.org/system/files/nsdi21-wang.pdf
.. _`Ray paper`: https://arxiv.org/abs/1712.05889
.. _`Ray HotOS paper`: https://arxiv.org/abs/1703.03924
.. _`RLlib paper`: https://arxiv.org/abs/1712.09381
.. _`Tune paper`: https://arxiv.org/abs/1807.05118

Getting Involved
----------------

.. list-table::
   :widths: 25 50 25 25
   :header-rows: 1

   * - Platform
     - Purpose
     - Estimated Response Time
     - Support Level
   * - `Discourse Forum`_
     - For discussions about development and questions about usage.
     - < 1 day
     - Community
   * - `GitHub Issues`_
     - For reporting bugs and filing feature requests.
     - < 2 days
     - Ray OSS Team
   * - `Slack`_
     - For collaborating with other Ray users.
     - < 2 days
     - Community
   * - `StackOverflow`_
     - For asking questions about how to use Ray.
     - 3-5 days
     - Community
   * - `Meetup Group`_
     - For learning about Ray projects and best practices.
     - Monthly
     - Ray DevRel
   * - `Twitter`_
     - For staying up-to-date on new features.
     - Daily
     - Ray DevRel

.. _`Discourse Forum`: https://discuss.ray.io/
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`Meetup Group`: https://www.meetup.com/Bay-Area-Ray-Meetup/
.. _`Twitter`: https://twitter.com/raydistributed
.. _`Slack`: https://forms.gle/9TSdDYUgxYs8SA9e8