facebookresearch · mthrok · Aug 21, 2025
diff --git a/docs/assets/preprocessing_random_aug.png → ...atic/data/io_preprocessing_random_aug.png b/docs/assets/preprocessing_random_aug.png → ...atic/data/io_preprocessing_random_aug.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -45,6 +45,7 @@ Please use the following BibTex for citing our project if you find it useful.
    installation
    getting_started/index
    async/index
+   io/index
    optimization_guide/index
    case_studies/index
    migration/index

diff --git a/docs/source/io/advanced.rst b/docs/source/io/advanced.rst
@@ -0,0 +1,2 @@
+Customizing Decoding Process
+============================
diff --git a/docs/source/io/augmentation.rst b/docs/source/io/augmentation.rst
@@ -0,0 +1,134 @@
+.. _augmentation:
+
+Using filters for augmentation
+==============================
+
+When using FFmpeg-based decoders, FFmpeg can also apply preprocessing via
+`filters <https://ffmpeg.org/ffmpeg-filters.html>`_.
+
+Client code can pass a custom filter string to decoding functions.
+There are helper fuctions to facilitate the construction of filter expression
+for common usecases.
+
+- :py:func:`spdl.io.get_audio_filter_desc`
+- :py:func:`spdl.io.get_video_filter_desc`
+- :py:func:`spdl.io.get_filter_desc`
+
+.. note::
+
+   For image/video data, it is often more efficient to apply augmentation
+   before converting them to RGB format.
+   (i.e. applying the same augmentation on an RGB array)
+
+   Many image/video files are encoded with YUV420 format, which has half the number
+   of pixels compared to RGB24.
+
+   The :py:func:`~spdl.io.get_video_filter_desc` function performs custom
+   filtering before changing the pixel format.
+
+Custom Filtering
+----------------
+
+Filter description is a simple ``str`` object.
+You can write a custom filter description by yourself or
+insert a part of filter descriptions to the previous helper functions.
+
+.. note::
+
+    The filtering is also used for trimming the packets with the user-specified
+    timestamp.
+    When demuxing and decoding audio/video for a specific time window,
+    (i.e. using ``timestamp`` option when using ``get_audio_filter_desc`` /
+    ``get_video_filter_desc``)
+    packets returned by demuxers contain frames outside of the window,
+    because they are necessary to correctly decode frames.
+    This process also creates frames outside of the window.
+    Filterings such as `trim` and `atrim` are used to remove these frames.
+
+    Therefore, when you create a custom filter for audio/video,
+    make sure that the resulting filter removes frames outside of specified window.
+
+Image augmentation
+------------------
+
+Using filters like ``hflip``, ``vflip``, ``rotate``, ``scale`` and ``crop``,
+we can compose an augmentation pipeline.
+
+For the detail of each filter, please refer to
+`the filter documentation <https://ffmpeg.org/ffmpeg-filters.html>`_.
+
+.. code-block::
+
+   >>> import random
+   >>>
+   >>> def random_augmentation():
+   ...     filters = []
+   ...
+   ...     # random_hflip
+   ...     if bool(random.getrandbits(1)):
+   ...         filters.append("hflip")
+   ...
+   ...     # random_vflip
+   ...     if bool(random.getrandbits(1)):
+   ...         filters.append("vflip")
+   ...
+   ...     # random_rotate +/- 30 deg
+   ...     angle = (60 * random.random() - 30) * 3.14 / 180
+   ...     filters.append(f"rotate=angle={angle:.2f}")
+   ...
+   ...     # resize
+   ...     filters.append("scale=256:256")
+   ...
+   ...     # random_crop
+   ...     x_pos, y_pos = random.random(), random.random()
+   ...     filters.append(f"crop=224:224:x={x_pos:.2f}*(iw-ow):y={y_pos:.2f}*(ih-oh)")
+   ...
+   ...     filter_desc = ",".join(filters)
+   ...
+   ...     return filter_desc
+
+
+   >>> def load_with_augmentation(src):
+   ...     packets = spdl.io.demux_image(src)
+   ...
+   ...     frames = spdl.io.decode_packets(
+   ...         packets,
+   ...         filter_desc=get_video_filter_desc(filter_desc=filter_desc, pix_fmt="rgb24"),
+   ...     )
+   ...
+   ...     return spdl.io.convert_frames(frames)
+
+
+This generates filter descriptions like the following
+
+.. code-block::
+
+   "hflip,rotate=angle=-0.05,scale=256:256,crop=224:224:x=0.18*(iw-ow):y=0.17*(ih-oh)"
+   "hflip,vflip,rotate=angle=-0.37,scale=256:256,crop=224:224:x=0.09*(iw-ow):y=0.96*(ih-oh)"
+   "rotate=angle=0.33,scale=256:256,crop=224:224:x=0.58*(iw-ow):y=0.57*(ih-oh)"
+   "hflip,vflip,rotate=angle=0.30,scale=256:256,crop=224:224:x=0.80*(iw-ow):y=0.35*(ih-oh)"
+   "hflip,vflip,rotate=angle=0.02,scale=256:256,crop=224:224:x=0.01*(iw-ow):y=0.25*(ih-oh)"
+   "vflip,rotate=angle=0.35,scale=256:256,crop=224:224:x=0.42*(iw-ow):y=0.69*(ih-oh)"
+   "hflip,rotate=angle=0.22,scale=256:256,crop=224:224:x=0.10*(iw-ow):y=0.03*(ih-oh)"
+   "hflip,rotate=angle=-0.18,scale=256:256,crop=224:224:x=0.65*(iw-ow):y=0.31*(ih-oh)"
+   "rotate=angle=-0.13,scale=256:256,crop=224:224:x=0.37*(iw-ow):y=0.75*(ih-oh)"
+   "hflip,vflip,rotate=angle=0.01,scale=256:256,crop=224:224:x=0.27*(iw-ow):y=0.84*(ih-oh)"
+   "hflip,rotate=angle=-0.31,scale=256:256,crop=224:224:x=0.43*(iw-ow):y=0.92*(ih-oh)"
+   "hflip,rotate=angle=-0.27,scale=256:256,crop=224:224:x=0.96*(iw-ow):y=0.92*(ih-oh)"
+   "vflip,rotate=angle=-0.28,scale=256:256,crop=224:224:x=0.61*(iw-ow):y=0.04*(ih-oh)"
+   "hflip,vflip,rotate=angle=0.08,scale=256:256,crop=224:224:x=0.84*(iw-ow):y=0.57*(ih-oh)"
+   "hflip,vflip,rotate=angle=0.41,scale=256:256,crop=224:224:x=0.24*(iw-ow):y=0.92*(ih-oh)"
+   "hflip,rotate=angle=-0.02,scale=256:256,crop=224:224:x=0.47*(iw-ow):y=0.87*(ih-oh)"
+   "hflip,rotate=angle=-0.15,scale=256:256,crop=224:224:x=0.73*(iw-ow):y=0.30*(ih-oh)"
+   "vflip,rotate=angle=-0.13,scale=256:256,crop=224:224:x=0.91*(iw-ow):y=0.85*(ih-oh)"
+   "vflip,rotate=angle=0.28,scale=256:256,crop=224:224:x=0.62*(iw-ow):y=0.02*(ih-oh)"
+   "rotate=angle=0.24,scale=256:256,crop=224:224:x=0.85*(iw-ow):y=0.61*(ih-oh)"
+   "vflip,rotate=angle=-0.52,scale=256:256,crop=224:224:x=0.61*(iw-ow):y=0.59*(ih-oh)"
+   "vflip,rotate=angle=0.06,scale=256:256,crop=224:224:x=0.08*(iw-ow):y=0.04*(ih-oh)"
+   "hflip,rotate=angle=0.50,scale=256:256,crop=224:224:x=0.23*(iw-ow):y=0.42*(ih-oh)"
+   "vflip,rotate=angle=0.18,scale=256:256,crop=224:224:x=0.54*(iw-ow):y=0.34*(ih-oh)"
+
+and here are the resulting images.
+
+.. image:: ../../_static/data/io_preprocessing_random_aug.png
+
diff --git a/docs/source/io/basic.rst b/docs/source/io/basic.rst
@@ -0,0 +1,116 @@
+Loading media as array
+======================
+
+To simply load audio/video/image data from a file or in-memory buffer, you can use the following functions.
+
+- :py:func:`spdl.io.load_audio`
+- :py:func:`spdl.io.load_video`
+- :py:func:`spdl.io.load_image`
+
+They return an object of :py:class:`spdl.io.CPUBuffer` or :py:class:`spdl.io.CUDABuffer`.
+The buffer object contains the decoded data as a contiguous memory.
+It implements `the array interface protocol <https://numpy.org/doc/stable/reference/arrays.interface.html>`_ or `the CUDA array interface <https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html>`_,
+which allows to convert the buffer object into commonly used array class without
+making a copy.
+
+SPDL has the following functions to cast the buffer to framework-specific class.
+
+- :py:func:`spdl.io.to_numpy` †
+- :py:func:`spdl.io.to_torch`
+- :py:func:`spdl.io.to_jax` †
+- :py:func:`spdl.io.to_numba`
+
+† :py:func:`~spdl.io.to_numpy` and  :py:func:`~spdl.io.to_jax` support only CPU array.
+
+.. admonition:: Example - Loading audio from bytes
+   :class: note
+
+   .. code-block::
+
+      data: bytes = download_from_remote_storage("my_audio.mp3")
+      buffer: CPUBuffer = spdl.io.load_audio(data)
+
+      # Cast to NumPy NDArray
+      array = spdl.io.to_numpy(buffer)  # shape: (time, channel)
+
+.. admonition:: Example - Loading image from file
+   :class: note
+
+   .. code-block::
+
+      buffer: CPUBuffer = spdl.io.load_image("my_image.jpg")
+
+      # Cast to PyTorch tensor
+      tensor = spdl.io.to_torch(buffer)  # shape: (height, width, channel=RGB)
+
+.. admonition:: Example - Loading video from remote
+   :class: note
+
+   .. code-block::
+
+      buffer: CPUBuffer = spdl.io.load_video("https://example.com/my_video.mp4")
+
+      # Cast to PyTorch tensor
+      tensor = spdl.io.to_torch(buffer)  # shape: (time, height, width, channel=RGB)
+
+
+By default, image/video are converted to interleaved RGB format (i.e. ``NHWC`` where ``C=3``),
+and audio is converted to 32-bit floating point sample of interleaved channels. (i.e. channel-last and ``dtype=float32``)
+
+To change the output format, you can customize the conversion behavior by providing
+a custom ``filter_desc`` value.
+
+You can use :py:func:`spdl.io.get_audio_filter_desc` and
+:py:func:`spdl.io.get_video_filter_desc` (for image and video) to construct
+a filter description.
+
+.. admonition:: Example - Customizing audio output format
+   :class: note
+
+   The following code snippet shows hot to decode audio into
+   16k Hz, monaural, 16-bit signed integer with planar format (i.e. channel first).
+   It also fix the duration to 5 seconds (80,000 samples) by silencing the residual
+   or padding the silence at the end.
+
+   .. code-block::
+
+      buffer: CPUBuffer = spdl.io.load_audio(
+          "my_audio.wav",
+          filter_desc=spdl.io.get_audio_filter_desc(
+              sample_rate=16_000,
+              num_channels=1,
+              sample_fmt="s16p",  # signed 16-bit, planar format
+              num_frames=80_000,  # 5 seconds
+          )
+      )
+      array = spdl.io.to_numpy(buffer)
+      array.shape  # (1, 8000)
+      array.dtype  # int16
+
+.. admonition:: Example - Customizing the image output format
+   :class: note
+
+
+   .. code-block::
+
+      buffer: CPUBuffer = spdl.io.load_video(
+          "my_video.wav",
+          filter_desc=spdl.io.get_video_filter_desc(
+              frame_rate=30,      # Change frame rate by duplicating or culling frames.
+              scale_width=256,    # Rescale the image to 256, 256, with bicubic
+              scale_height=256,
+              scale_algo='bicubic',
+              scale_mode='pad',   # pad the image if the aspect ratio of the
+                                  # original resolution and
+                                  # rescaling do not match.
+              crop_width=128,     # Perform copping
+              crop_height=128,
+              num_frames=10,      # Stop decoding after 10 frames.
+                                  # If it's shorter than 10 frames,
+                                  # insert extra frames.
+              pad_mode="black"    # Specify the extra frames to be black.
+          )
+      )
+      array = spdl.io.to_numpy(buffer)
+      array.shape  # (10, 128, 128, 3)
+      array.dtype  # uint8
diff --git a/docs/source/io/decoding_overview.rst b/docs/source/io/decoding_overview.rst
@@ -0,0 +1,120 @@
+How media is processed
+======================
+
+In the previous section, we looked at how to load multi-media data into array format,
+and customize its output format.
+The output format is customized after the media is decoded.
+In the next section, we will look into ways to customize the decoding process.
+
+Before looking into customization APIs,
+we look at the processes that media data goes through,
+so that we have a good understanding of how and what to customize.
+
+Thes processes are generic and not specific to SPDL.
+The underlying implementation is provided by FFmpeg,
+which is the industry-standard tool for processing media in
+streaming fashion.
+
+Decoding media comprises of two mondatory steps (demuxing and decoding) and
+optional filtering steps.
+Frames are allocated in the heap when decoding.
+They are combined into one contiguous memory region to form an array.
+We call this process the buffer conversion.
+
+The following diagram illustrates this.
+
+.. mermaid::
+
+   flowchart TD
+    subgraph Demuxing
+      direction LR
+      b(Bite String) --> |Demuxing| p1(Packet)
+      p1 --> |Bitstream Filtering
+      &#40Optional&#41| p2(Packet)
+    end
+    subgraph Decoding
+      direction LR
+      p3(Packet) --> |Decoding| f1(Frame)
+      f1 --> |Filtering| f2(Frame)
+
+    end
+    subgraph c["Buffer Conversion"]
+      subgraph ff[" "]
+        f3(Frame)
+        f4(Frame)
+        f5(Frame)
+      end
+      ff  --> |Buffer Conversion| b2(Buffer)
+    end
+    Demuxing --> Decoding --> c
+
+
+Demuxing
+~~~~~~~~
+
+The `demuxing <https://en.wikipedia.org/wiki/Demultiplexer_(media_file)>`_
+(short for demultiplexing) is a process to split the input data into smaller chunks.
+
+For example, a video has multiple media streams (typically one audio and one video) and
+they are storead as a series of data called "packet".
+
+The demuxing is a process to find data boundary and extract these packets one-by-one.
+
+.. mermaid::
+
+   block-beta
+        columns 1
+        b["0101010101100101...................................."]
+        space
+        block:demuxed
+            p0[["Header"]]
+            p1(["Audio 0"])
+            p2["Video 0"]
+            p3["Video 1"]
+            p4["Video 2"]
+            p5["Video 3"]
+            p6(["Audio 1"])
+            p7["Video 4"]
+            p8["Video 5"]
+        end
+        b-- "demuxing" -->demuxed
+
+Decoding
+~~~~~~~~
+
+Multi-media files are usually encoded to reduce the file size.
+The decoding is a process to recover the media from the encoded data.
+
+The decoded data are called frames, and they contain waveform samples (audio)
+or image frame samples (image/video).
+
+Buffer Conversion
+~~~~~~~~~~~~~~~~~
+
+The buffer conversion is the step to merge multiple frames into one
+contiguous memory so that we can handle them as an array data.
+
+Bitstream Filtering
+~~~~~~~~~~~~~~~~~~~
+
+The bitstream filtering is a process to modify packets.
+You can refer to
+`FFmpeg Bitstream Filters Documentation <https://ffmpeg.org/ffmpeg-bitstream-filters.html>`_
+for available operations.
+
+In SPDL, the most relevant operations are ``h264_mp4toannexb`` and
+``hevc_mp4toannexb``, which are necessary when using GPU video decoding.
+See :ref:`gpu-video-decoder`
+
+Filtering
+~~~~~~~~~
+
+The frame filtering is a versatile process.
+It can apply many different operations.
+
+Please refer to `FFmpeg Filters Documentation <https://ffmpeg.org/ffmpeg-filters.html>`_
+for available filters.
+
+By default it is used to change the output format.
+You can also apply augmentation using filters.
+See :ref:`augmentation` for the detail.
diff --git a/docs/source/io/gpu_decoder.rst b/docs/source/io/gpu_decoder.rst
@@ -0,0 +1,4 @@
+.. _gpu-video-decoder:
+
+GPU Video Decoder
+=================
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Customizing Decoding Process
		============================