Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Please use the following BibTex for citing our project if you find it useful.
installation
getting_started/index
async/index
io/index
optimization_guide/index
case_studies/index
migration/index
Expand Down
2 changes: 2 additions & 0 deletions docs/source/io/advanced.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Customizing Decoding Process
============================
134 changes: 134 additions & 0 deletions docs/source/io/augmentation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
.. _augmentation:

Using filters for augmentation
==============================

When using FFmpeg-based decoders, FFmpeg can also apply preprocessing via
`filters <https://ffmpeg.org/ffmpeg-filters.html>`_.

Client code can pass a custom filter string to decoding functions.
There are helper fuctions to facilitate the construction of filter expression
for common usecases.

- :py:func:`spdl.io.get_audio_filter_desc`
- :py:func:`spdl.io.get_video_filter_desc`
- :py:func:`spdl.io.get_filter_desc`

.. note::

For image/video data, it is often more efficient to apply augmentation
before converting them to RGB format.
(i.e. applying the same augmentation on an RGB array)

Many image/video files are encoded with YUV420 format, which has half the number
of pixels compared to RGB24.

The :py:func:`~spdl.io.get_video_filter_desc` function performs custom
filtering before changing the pixel format.

Custom Filtering
----------------

Filter description is a simple ``str`` object.
You can write a custom filter description by yourself or
insert a part of filter descriptions to the previous helper functions.

.. note::

The filtering is also used for trimming the packets with the user-specified
timestamp.
When demuxing and decoding audio/video for a specific time window,
(i.e. using ``timestamp`` option when using ``get_audio_filter_desc`` /
``get_video_filter_desc``)
packets returned by demuxers contain frames outside of the window,
because they are necessary to correctly decode frames.
This process also creates frames outside of the window.
Filterings such as `trim` and `atrim` are used to remove these frames.

Therefore, when you create a custom filter for audio/video,
make sure that the resulting filter removes frames outside of specified window.

Image augmentation
------------------

Using filters like ``hflip``, ``vflip``, ``rotate``, ``scale`` and ``crop``,
we can compose an augmentation pipeline.

For the detail of each filter, please refer to
`the filter documentation <https://ffmpeg.org/ffmpeg-filters.html>`_.

.. code-block::

>>> import random
>>>
>>> def random_augmentation():
... filters = []
...
... # random_hflip
... if bool(random.getrandbits(1)):
... filters.append("hflip")
...
... # random_vflip
... if bool(random.getrandbits(1)):
... filters.append("vflip")
...
... # random_rotate +/- 30 deg
... angle = (60 * random.random() - 30) * 3.14 / 180
... filters.append(f"rotate=angle={angle:.2f}")
...
... # resize
... filters.append("scale=256:256")
...
... # random_crop
... x_pos, y_pos = random.random(), random.random()
... filters.append(f"crop=224:224:x={x_pos:.2f}*(iw-ow):y={y_pos:.2f}*(ih-oh)")
...
... filter_desc = ",".join(filters)
...
... return filter_desc


>>> def load_with_augmentation(src):
... packets = spdl.io.demux_image(src)
...
... frames = spdl.io.decode_packets(
... packets,
... filter_desc=get_video_filter_desc(filter_desc=filter_desc, pix_fmt="rgb24"),
... )
...
... return spdl.io.convert_frames(frames)


This generates filter descriptions like the following

.. code-block::

"hflip,rotate=angle=-0.05,scale=256:256,crop=224:224:x=0.18*(iw-ow):y=0.17*(ih-oh)"
"hflip,vflip,rotate=angle=-0.37,scale=256:256,crop=224:224:x=0.09*(iw-ow):y=0.96*(ih-oh)"
"rotate=angle=0.33,scale=256:256,crop=224:224:x=0.58*(iw-ow):y=0.57*(ih-oh)"
"hflip,vflip,rotate=angle=0.30,scale=256:256,crop=224:224:x=0.80*(iw-ow):y=0.35*(ih-oh)"
"hflip,vflip,rotate=angle=0.02,scale=256:256,crop=224:224:x=0.01*(iw-ow):y=0.25*(ih-oh)"
"vflip,rotate=angle=0.35,scale=256:256,crop=224:224:x=0.42*(iw-ow):y=0.69*(ih-oh)"
"hflip,rotate=angle=0.22,scale=256:256,crop=224:224:x=0.10*(iw-ow):y=0.03*(ih-oh)"
"hflip,rotate=angle=-0.18,scale=256:256,crop=224:224:x=0.65*(iw-ow):y=0.31*(ih-oh)"
"rotate=angle=-0.13,scale=256:256,crop=224:224:x=0.37*(iw-ow):y=0.75*(ih-oh)"
"hflip,vflip,rotate=angle=0.01,scale=256:256,crop=224:224:x=0.27*(iw-ow):y=0.84*(ih-oh)"
"hflip,rotate=angle=-0.31,scale=256:256,crop=224:224:x=0.43*(iw-ow):y=0.92*(ih-oh)"
"hflip,rotate=angle=-0.27,scale=256:256,crop=224:224:x=0.96*(iw-ow):y=0.92*(ih-oh)"
"vflip,rotate=angle=-0.28,scale=256:256,crop=224:224:x=0.61*(iw-ow):y=0.04*(ih-oh)"
"hflip,vflip,rotate=angle=0.08,scale=256:256,crop=224:224:x=0.84*(iw-ow):y=0.57*(ih-oh)"
"hflip,vflip,rotate=angle=0.41,scale=256:256,crop=224:224:x=0.24*(iw-ow):y=0.92*(ih-oh)"
"hflip,rotate=angle=-0.02,scale=256:256,crop=224:224:x=0.47*(iw-ow):y=0.87*(ih-oh)"
"hflip,rotate=angle=-0.15,scale=256:256,crop=224:224:x=0.73*(iw-ow):y=0.30*(ih-oh)"
"vflip,rotate=angle=-0.13,scale=256:256,crop=224:224:x=0.91*(iw-ow):y=0.85*(ih-oh)"
"vflip,rotate=angle=0.28,scale=256:256,crop=224:224:x=0.62*(iw-ow):y=0.02*(ih-oh)"
"rotate=angle=0.24,scale=256:256,crop=224:224:x=0.85*(iw-ow):y=0.61*(ih-oh)"
"vflip,rotate=angle=-0.52,scale=256:256,crop=224:224:x=0.61*(iw-ow):y=0.59*(ih-oh)"
"vflip,rotate=angle=0.06,scale=256:256,crop=224:224:x=0.08*(iw-ow):y=0.04*(ih-oh)"
"hflip,rotate=angle=0.50,scale=256:256,crop=224:224:x=0.23*(iw-ow):y=0.42*(ih-oh)"
"vflip,rotate=angle=0.18,scale=256:256,crop=224:224:x=0.54*(iw-ow):y=0.34*(ih-oh)"

and here are the resulting images.

.. image:: ../../_static/data/io_preprocessing_random_aug.png

116 changes: 116 additions & 0 deletions docs/source/io/basic.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
Loading media as array
======================

To simply load audio/video/image data from a file or in-memory buffer, you can use the following functions.

- :py:func:`spdl.io.load_audio`
- :py:func:`spdl.io.load_video`
- :py:func:`spdl.io.load_image`

They return an object of :py:class:`spdl.io.CPUBuffer` or :py:class:`spdl.io.CUDABuffer`.
The buffer object contains the decoded data as a contiguous memory.
It implements `the array interface protocol <https://numpy.org/doc/stable/reference/arrays.interface.html>`_ or `the CUDA array interface <https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html>`_,
which allows to convert the buffer object into commonly used array class without
making a copy.

SPDL has the following functions to cast the buffer to framework-specific class.

- :py:func:`spdl.io.to_numpy` †
- :py:func:`spdl.io.to_torch`
- :py:func:`spdl.io.to_jax` †
- :py:func:`spdl.io.to_numba`

† :py:func:`~spdl.io.to_numpy` and :py:func:`~spdl.io.to_jax` support only CPU array.

.. admonition:: Example - Loading audio from bytes
:class: note

.. code-block::

data: bytes = download_from_remote_storage("my_audio.mp3")
buffer: CPUBuffer = spdl.io.load_audio(data)

# Cast to NumPy NDArray
array = spdl.io.to_numpy(buffer) # shape: (time, channel)

.. admonition:: Example - Loading image from file
:class: note

.. code-block::

buffer: CPUBuffer = spdl.io.load_image("my_image.jpg")

# Cast to PyTorch tensor
tensor = spdl.io.to_torch(buffer) # shape: (height, width, channel=RGB)

.. admonition:: Example - Loading video from remote
:class: note

.. code-block::

buffer: CPUBuffer = spdl.io.load_video("https://example.com/my_video.mp4")

# Cast to PyTorch tensor
tensor = spdl.io.to_torch(buffer) # shape: (time, height, width, channel=RGB)


By default, image/video are converted to interleaved RGB format (i.e. ``NHWC`` where ``C=3``),
and audio is converted to 32-bit floating point sample of interleaved channels. (i.e. channel-last and ``dtype=float32``)

To change the output format, you can customize the conversion behavior by providing
a custom ``filter_desc`` value.

You can use :py:func:`spdl.io.get_audio_filter_desc` and
:py:func:`spdl.io.get_video_filter_desc` (for image and video) to construct
a filter description.

.. admonition:: Example - Customizing audio output format
:class: note

The following code snippet shows hot to decode audio into
16k Hz, monaural, 16-bit signed integer with planar format (i.e. channel first).
It also fix the duration to 5 seconds (80,000 samples) by silencing the residual
or padding the silence at the end.

.. code-block::

buffer: CPUBuffer = spdl.io.load_audio(
"my_audio.wav",
filter_desc=spdl.io.get_audio_filter_desc(
sample_rate=16_000,
num_channels=1,
sample_fmt="s16p", # signed 16-bit, planar format
num_frames=80_000, # 5 seconds
)
)
array = spdl.io.to_numpy(buffer)
array.shape # (1, 8000)
array.dtype # int16

.. admonition:: Example - Customizing the image output format
:class: note


.. code-block::

buffer: CPUBuffer = spdl.io.load_video(
"my_video.wav",
filter_desc=spdl.io.get_video_filter_desc(
frame_rate=30, # Change frame rate by duplicating or culling frames.
scale_width=256, # Rescale the image to 256, 256, with bicubic
scale_height=256,
scale_algo='bicubic',
scale_mode='pad', # pad the image if the aspect ratio of the
# original resolution and
# rescaling do not match.
crop_width=128, # Perform copping
crop_height=128,
num_frames=10, # Stop decoding after 10 frames.
# If it's shorter than 10 frames,
# insert extra frames.
pad_mode="black" # Specify the extra frames to be black.
)
)
array = spdl.io.to_numpy(buffer)
array.shape # (10, 128, 128, 3)
array.dtype # uint8
120 changes: 120 additions & 0 deletions docs/source/io/decoding_overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
How media is processed
======================

In the previous section, we looked at how to load multi-media data into array format,
and customize its output format.
The output format is customized after the media is decoded.
In the next section, we will look into ways to customize the decoding process.

Before looking into customization APIs,
we look at the processes that media data goes through,
so that we have a good understanding of how and what to customize.

Thes processes are generic and not specific to SPDL.
The underlying implementation is provided by FFmpeg,
which is the industry-standard tool for processing media in
streaming fashion.

Decoding media comprises of two mondatory steps (demuxing and decoding) and
optional filtering steps.
Frames are allocated in the heap when decoding.
They are combined into one contiguous memory region to form an array.
We call this process the buffer conversion.

The following diagram illustrates this.

.. mermaid::

flowchart TD
subgraph Demuxing
direction LR
b(Bite String) --> |Demuxing| p1(Packet)
p1 --> |Bitstream Filtering
&#40Optional&#41| p2(Packet)
end
subgraph Decoding
direction LR
p3(Packet) --> |Decoding| f1(Frame)
f1 --> |Filtering| f2(Frame)

end
subgraph c["Buffer Conversion"]
subgraph ff[" "]
f3(Frame)
f4(Frame)
f5(Frame)
end
ff --> |Buffer Conversion| b2(Buffer)
end
Demuxing --> Decoding --> c


Demuxing
~~~~~~~~

The `demuxing <https://en.wikipedia.org/wiki/Demultiplexer_(media_file)>`_
(short for demultiplexing) is a process to split the input data into smaller chunks.

For example, a video has multiple media streams (typically one audio and one video) and
they are storead as a series of data called "packet".

The demuxing is a process to find data boundary and extract these packets one-by-one.

.. mermaid::

block-beta
columns 1
b["0101010101100101...................................."]
space
block:demuxed
p0[["Header"]]
p1(["Audio 0"])
p2["Video 0"]
p3["Video 1"]
p4["Video 2"]
p5["Video 3"]
p6(["Audio 1"])
p7["Video 4"]
p8["Video 5"]
end
b-- "demuxing" -->demuxed

Decoding
~~~~~~~~

Multi-media files are usually encoded to reduce the file size.
The decoding is a process to recover the media from the encoded data.

The decoded data are called frames, and they contain waveform samples (audio)
or image frame samples (image/video).

Buffer Conversion
~~~~~~~~~~~~~~~~~

The buffer conversion is the step to merge multiple frames into one
contiguous memory so that we can handle them as an array data.

Bitstream Filtering
~~~~~~~~~~~~~~~~~~~

The bitstream filtering is a process to modify packets.
You can refer to
`FFmpeg Bitstream Filters Documentation <https://ffmpeg.org/ffmpeg-bitstream-filters.html>`_
for available operations.

In SPDL, the most relevant operations are ``h264_mp4toannexb`` and
``hevc_mp4toannexb``, which are necessary when using GPU video decoding.
See :ref:`gpu-video-decoder`

Filtering
~~~~~~~~~

The frame filtering is a versatile process.
It can apply many different operations.

Please refer to `FFmpeg Filters Documentation <https://ffmpeg.org/ffmpeg-filters.html>`_
for available filters.

By default it is used to change the output format.
You can also apply augmentation using filters.
See :ref:`augmentation` for the detail.
4 changes: 4 additions & 0 deletions docs/source/io/gpu_decoder.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.. _gpu-video-decoder:

GPU Video Decoder
=================
Loading
Loading