mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
[docs] Try to clarify some advantages of bulk ingest in the AIR ingest docs (#25616)
This commit is contained in:
parent
f17ced04dd
commit
a058a98c5d
2 changed files with 8 additions and 3 deletions
2
.github/CODEOWNERS
vendored
2
.github/CODEOWNERS
vendored
|
@ -7,7 +7,7 @@
|
||||||
|
|
||||||
# ==== Documentation ====
|
# ==== Documentation ====
|
||||||
|
|
||||||
/doc/ @maxpumperla @pcmoritz @richardliaw @edoakes @simon-mo
|
/doc/ @maxpumperla @pcmoritz @richardliaw @edoakes @simon-mo @ericl
|
||||||
|
|
||||||
# ==== Ray core ====
|
# ==== Ray core ====
|
||||||
|
|
||||||
|
|
|
@ -97,7 +97,8 @@ Bulk Ingest
|
||||||
~~~~~~~~~~~
|
~~~~~~~~~~~
|
||||||
|
|
||||||
By default, AIR loads all Dataset blocks into the object store at the start of training. This provides the best performance if the
|
By default, AIR loads all Dataset blocks into the object store at the start of training. This provides the best performance if the
|
||||||
cluster has enough aggregate memory to fit all the data blocks in object store memory. Note that data often requires more space
|
cluster has enough aggregate memory to fit all the data blocks in object store memory, or if your preprocessing step is expensive
|
||||||
|
and you don't want it to be re-run on each epoch. Note that data often requires more space
|
||||||
when loaded uncompressed in memory than when resident in storage.
|
when loaded uncompressed in memory than when resident in storage.
|
||||||
|
|
||||||
If there is insufficient object store memory, blocks may be spilled to disk during reads or preprocessing. Ray will print log messages
|
If there is insufficient object store memory, blocks may be spilled to disk during reads or preprocessing. Ray will print log messages
|
||||||
|
@ -110,13 +111,17 @@ Streaming Ingest
|
||||||
|
|
||||||
AIR also supports streaming ingest via the DatasetPipeline feature. Streaming ingest is preferable when you are using large datasets
|
AIR also supports streaming ingest via the DatasetPipeline feature. Streaming ingest is preferable when you are using large datasets
|
||||||
that don't fit into memory, and prefer to read *windows* of data from storage to minimize the active memory required for data ingest.
|
that don't fit into memory, and prefer to read *windows* of data from storage to minimize the active memory required for data ingest.
|
||||||
|
Note that streaming ingest will re-execute preprocessing on each pass over the data. If preprocessing is a bottleneck, consider
|
||||||
|
using bulk ingest instead for better performance.
|
||||||
|
|
||||||
To enable streaming ingest, set ``use_stream_api=True`` in the dataset config. By default, this will configure streaming ingest with a window
|
To enable streaming ingest, set ``use_stream_api=True`` in the dataset config. By default, this will configure streaming ingest with a window
|
||||||
size of 1GiB, which means AIR will load ~1 GiB of data at a time from the datasource.
|
size of 1GiB, which means AIR will load ~1 GiB of data at a time from the datasource.
|
||||||
Performance can be increased with larger window sizes, which can be adjusted using the ``stream_window_size`` config.
|
Performance can be increased with larger window sizes, which can be adjusted using the ``stream_window_size`` config.
|
||||||
A reasonable stream window size is something like 20% of available object store memory. Note that the data may be larger
|
A reasonable stream window size is something like 20% of available object store memory. Note that the data may be larger
|
||||||
once deserialized in memory, or if individual files are larger than the window size.
|
once deserialized in memory, or if individual files are larger than the window size.
|
||||||
If the window size is set to -1, then an infinite window size (equivalent to bulk loading) will be used.
|
|
||||||
|
If the window size is set to -1, then an infinite window size will be used. This case is equivalent to using bulk loading
|
||||||
|
(including the performance advantages of caching preprocessed blocks), but still exposing a DatasetPipeline reader.
|
||||||
|
|
||||||
.. warning::
|
.. warning::
|
||||||
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue