Skip to content

Commit 15fc2f6

Browse files
committed
Add doc for IO
1 parent 530d244 commit 15fc2f6

File tree

12 files changed

+560
-14
lines changed

12 files changed

+560
-14
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Please use the following BibTex for citing our project if you find it useful.
4343
installation
4444
getting_started/index
4545
async/index
46+
io/index
4647
optimization_guide/index
4748
case_studies/index
4849
migration/index

docs/source/io/advanced.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Customizing Decoding Process
2+
============================

docs/source/io/augmentation.rst

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
.. _augmentation:
2+
3+
Using filters for augmentation
4+
==============================
5+
6+
When using FFmpeg-based decoders, FFmpeg can also apply preprocessing via
7+
`filters <https://ffmpeg.org/ffmpeg-filters.html>`_.
8+
9+
Client code can pass a custom filter string to decoding functions.
10+
There are helper fuctions to facilitate the construction of filter expression
11+
for common usecases.
12+
13+
- :py:func:`spdl.io.get_audio_filter_desc`
14+
- :py:func:`spdl.io.get_video_filter_desc`
15+
- :py:func:`spdl.io.get_filter_desc`
16+
17+
.. note::
18+
19+
For image/video data, it is often more efficient to apply augmentation
20+
before converting them to RGB format.
21+
(i.e. applying the same augmentation on an RGB array)
22+
23+
Many image/video files are encoded with YUV420 format, which has half the number
24+
of pixels compared to RGB24.
25+
26+
The :py:func:`~spdl.io.get_video_filter_desc` function performs custom
27+
filtering before changing the pixel format.
28+
29+
Custom Filtering
30+
----------------
31+
32+
Filter description is a simple ``str`` object.
33+
You can write a custom filter description by yourself or
34+
insert a part of filter descriptions to the previous helper functions.
35+
36+
.. note::
37+
38+
The filtering is also used for trimming the packets with the user-specified
39+
timestamp.
40+
When demuxing and decoding audio/video for a specific time window,
41+
(i.e. using ``timestamp`` option when using ``get_audio_filter_desc`` /
42+
``get_video_filter_desc``)
43+
packets returned by demuxers contain frames outside of the window,
44+
because they are necessary to correctly decode frames.
45+
This process also creates frames outside of the window.
46+
Filterings such as `trim` and `atrim` are used to remove these frames.
47+
48+
Therefore, when you create a custom filter for audio/video,
49+
make sure that the resulting filter removes frames outside of specified window.
50+
51+
Image augmentation
52+
------------------
53+
54+
Using filters like ``hflip``, ``vflip``, ``rotate``, ``scale`` and ``crop``,
55+
we can compose an augmentation pipeline.
56+
57+
For the detail of each filter, please refer to
58+
`the filter documentation <https://ffmpeg.org/ffmpeg-filters.html>`_.
59+
60+
.. code-block::
61+
62+
>>> import random
63+
>>>
64+
>>> def random_augmentation():
65+
... filters = []
66+
...
67+
... # random_hflip
68+
... if bool(random.getrandbits(1)):
69+
... filters.append("hflip")
70+
...
71+
... # random_vflip
72+
... if bool(random.getrandbits(1)):
73+
... filters.append("vflip")
74+
...
75+
... # random_rotate +/- 30 deg
76+
... angle = (60 * random.random() - 30) * 3.14 / 180
77+
... filters.append(f"rotate=angle={angle:.2f}")
78+
...
79+
... # resize
80+
... filters.append("scale=256:256")
81+
...
82+
... # random_crop
83+
... x_pos, y_pos = random.random(), random.random()
84+
... filters.append(f"crop=224:224:x={x_pos:.2f}*(iw-ow):y={y_pos:.2f}*(ih-oh)")
85+
...
86+
... filter_desc = ",".join(filters)
87+
...
88+
... return filter_desc
89+
90+
91+
>>> def load_with_augmentation(src):
92+
... packets = spdl.io.demux_image(src)
93+
...
94+
... frames = spdl.io.decode_packets(
95+
... packets,
96+
... filter_desc=get_video_filter_desc(filter_desc=filter_desc, pix_fmt="rgb24"),
97+
... )
98+
...
99+
... return spdl.io.convert_frames(frames)
100+
101+
102+
This generates filter descriptions like the following
103+
104+
.. code-block::
105+
106+
"hflip,rotate=angle=-0.05,scale=256:256,crop=224:224:x=0.18*(iw-ow):y=0.17*(ih-oh)"
107+
"hflip,vflip,rotate=angle=-0.37,scale=256:256,crop=224:224:x=0.09*(iw-ow):y=0.96*(ih-oh)"
108+
"rotate=angle=0.33,scale=256:256,crop=224:224:x=0.58*(iw-ow):y=0.57*(ih-oh)"
109+
"hflip,vflip,rotate=angle=0.30,scale=256:256,crop=224:224:x=0.80*(iw-ow):y=0.35*(ih-oh)"
110+
"hflip,vflip,rotate=angle=0.02,scale=256:256,crop=224:224:x=0.01*(iw-ow):y=0.25*(ih-oh)"
111+
"vflip,rotate=angle=0.35,scale=256:256,crop=224:224:x=0.42*(iw-ow):y=0.69*(ih-oh)"
112+
"hflip,rotate=angle=0.22,scale=256:256,crop=224:224:x=0.10*(iw-ow):y=0.03*(ih-oh)"
113+
"hflip,rotate=angle=-0.18,scale=256:256,crop=224:224:x=0.65*(iw-ow):y=0.31*(ih-oh)"
114+
"rotate=angle=-0.13,scale=256:256,crop=224:224:x=0.37*(iw-ow):y=0.75*(ih-oh)"
115+
"hflip,vflip,rotate=angle=0.01,scale=256:256,crop=224:224:x=0.27*(iw-ow):y=0.84*(ih-oh)"
116+
"hflip,rotate=angle=-0.31,scale=256:256,crop=224:224:x=0.43*(iw-ow):y=0.92*(ih-oh)"
117+
"hflip,rotate=angle=-0.27,scale=256:256,crop=224:224:x=0.96*(iw-ow):y=0.92*(ih-oh)"
118+
"vflip,rotate=angle=-0.28,scale=256:256,crop=224:224:x=0.61*(iw-ow):y=0.04*(ih-oh)"
119+
"hflip,vflip,rotate=angle=0.08,scale=256:256,crop=224:224:x=0.84*(iw-ow):y=0.57*(ih-oh)"
120+
"hflip,vflip,rotate=angle=0.41,scale=256:256,crop=224:224:x=0.24*(iw-ow):y=0.92*(ih-oh)"
121+
"hflip,rotate=angle=-0.02,scale=256:256,crop=224:224:x=0.47*(iw-ow):y=0.87*(ih-oh)"
122+
"hflip,rotate=angle=-0.15,scale=256:256,crop=224:224:x=0.73*(iw-ow):y=0.30*(ih-oh)"
123+
"vflip,rotate=angle=-0.13,scale=256:256,crop=224:224:x=0.91*(iw-ow):y=0.85*(ih-oh)"
124+
"vflip,rotate=angle=0.28,scale=256:256,crop=224:224:x=0.62*(iw-ow):y=0.02*(ih-oh)"
125+
"rotate=angle=0.24,scale=256:256,crop=224:224:x=0.85*(iw-ow):y=0.61*(ih-oh)"
126+
"vflip,rotate=angle=-0.52,scale=256:256,crop=224:224:x=0.61*(iw-ow):y=0.59*(ih-oh)"
127+
"vflip,rotate=angle=0.06,scale=256:256,crop=224:224:x=0.08*(iw-ow):y=0.04*(ih-oh)"
128+
"hflip,rotate=angle=0.50,scale=256:256,crop=224:224:x=0.23*(iw-ow):y=0.42*(ih-oh)"
129+
"vflip,rotate=angle=0.18,scale=256:256,crop=224:224:x=0.54*(iw-ow):y=0.34*(ih-oh)"
130+
131+
and here are the resulting images.
132+
133+
.. image:: ../../_static/data/io_preprocessing_random_aug.png
134+

docs/source/io/basic.rst

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
Loading media as array
2+
======================
3+
4+
To simply load audio/video/image data from a file or in-memory buffer, you can use the following functions.
5+
6+
- :py:func:`spdl.io.load_audio`
7+
- :py:func:`spdl.io.load_video`
8+
- :py:func:`spdl.io.load_image`
9+
10+
They return an object of :py:class:`spdl.io.CPUBuffer` or :py:class:`spdl.io.CUDABuffer`.
11+
The buffer object contains the decoded data as a contiguous memory.
12+
It implements `the array interface protocol <https://numpy.org/doc/stable/reference/arrays.interface.html>`_ or `the CUDA array interface <https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html>`_,
13+
which allows to convert the buffer object into commonly used array class without
14+
making a copy.
15+
16+
SPDL has the following functions to cast the buffer to framework-specific class.
17+
18+
- :py:func:`spdl.io.to_numpy` †
19+
- :py:func:`spdl.io.to_torch`
20+
- :py:func:`spdl.io.to_jax` †
21+
- :py:func:`spdl.io.to_numba`
22+
23+
:py:func:`~spdl.io.to_numpy` and :py:func:`~spdl.io.to_jax` support only CPU array.
24+
25+
.. admonition:: Example - Loading audio from bytes
26+
:class: note
27+
28+
.. code-block::
29+
30+
data: bytes = download_from_remote_storage("my_audio.mp3")
31+
buffer: CPUBuffer = spdl.io.load_audio(data)
32+
33+
# Cast to NumPy NDArray
34+
array = spdl.io.to_numpy(buffer) # shape: (time, channel)
35+
36+
.. admonition:: Example - Loading image from file
37+
:class: note
38+
39+
.. code-block::
40+
41+
buffer: CPUBuffer = spdl.io.load_image("my_image.jpg")
42+
43+
# Cast to PyTorch tensor
44+
tensor = spdl.io.to_torch(buffer) # shape: (height, width, channel=RGB)
45+
46+
.. admonition:: Example - Loading video from remote
47+
:class: note
48+
49+
.. code-block::
50+
51+
buffer: CPUBuffer = spdl.io.load_video("https://example.com/my_video.mp4")
52+
53+
# Cast to PyTorch tensor
54+
tensor = spdl.io.to_torch(buffer) # shape: (time, height, width, channel=RGB)
55+
56+
57+
By default, image/video are converted to interleaved RGB format (i.e. ``NHWC`` where ``C=3``),
58+
and audio is converted to 32-bit floating point sample of interleaved channels. (i.e. channel-last and ``dtype=float32``)
59+
60+
To change the output format, you can customize the conversion behavior by providing
61+
a custom ``filter_desc`` value.
62+
63+
You can use :py:func:`spdl.io.get_audio_filter_desc` and
64+
:py:func:`spdl.io.get_video_filter_desc` (for image and video) to construct
65+
a filter description.
66+
67+
.. admonition:: Example - Customizing audio output format
68+
:class: note
69+
70+
The following code snippet shows hot to decode audio into
71+
16k Hz, monaural, 16-bit signed integer with planar format (i.e. channel first).
72+
It also fix the duration to 5 seconds (80,000 samples) by silencing the residual
73+
or padding the silence at the end.
74+
75+
.. code-block::
76+
77+
buffer: CPUBuffer = spdl.io.load_audio(
78+
"my_audio.wav",
79+
filter_desc=spdl.io.get_audio_filter_desc(
80+
sample_rate=16_000,
81+
num_channels=1,
82+
sample_fmt="s16p", # signed 16-bit, planar format
83+
num_frames=80_000, # 5 seconds
84+
)
85+
)
86+
array = spdl.io.to_numpy(buffer)
87+
array.shape # (1, 8000)
88+
array.dtype # int16
89+
90+
.. admonition:: Example - Customizing the image output format
91+
:class: note
92+
93+
94+
.. code-block::
95+
96+
buffer: CPUBuffer = spdl.io.load_video(
97+
"my_video.wav",
98+
filter_desc=spdl.io.get_video_filter_desc(
99+
frame_rate=30, # Change frame rate by duplicating or culling frames.
100+
scale_width=256, # Rescale the image to 256, 256, with bicubic
101+
scale_height=256,
102+
scale_algo='bicubic',
103+
scale_mode='pad', # pad the image if the aspect ratio of the
104+
# original resolution and
105+
# rescaling do not match.
106+
crop_width=128, # Perform copping
107+
crop_height=128,
108+
num_frames=10, # Stop decoding after 10 frames.
109+
# If it's shorter than 10 frames,
110+
# insert extra frames.
111+
pad_mode="black" # Specify the extra frames to be black.
112+
)
113+
)
114+
array = spdl.io.to_numpy(buffer)
115+
array.shape # (10, 128, 128, 3)
116+
array.dtype # uint8
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
How media is processed
2+
======================
3+
4+
In the previous section, we looked at how to load multi-media data into array format,
5+
and customize its output format.
6+
The output format is customized after the media is decoded.
7+
In the next section, we will look into ways to customize the decoding process.
8+
9+
Before looking into customization APIs,
10+
we look at the processes that media data goes through,
11+
so that we have a good understanding of how and what to customize.
12+
13+
Thes processes are generic and not specific to SPDL.
14+
The underlying implementation is provided by FFmpeg,
15+
which is the industry-standard tool for processing media in
16+
streaming fashion.
17+
18+
Decoding media comprises of two mondatory steps (demuxing and decoding) and
19+
optional filtering steps.
20+
Frames are allocated in the heap when decoding.
21+
They are combined into one contiguous memory region to form an array.
22+
We call this process the buffer conversion.
23+
24+
The following diagram illustrates this.
25+
26+
.. mermaid::
27+
28+
flowchart TD
29+
subgraph Demuxing
30+
direction LR
31+
b(Bite String) --> |Demuxing| p1(Packet)
32+
p1 --> |Bitstream Filtering
33+
&#40Optional&#41| p2(Packet)
34+
end
35+
subgraph Decoding
36+
direction LR
37+
p3(Packet) --> |Decoding| f1(Frame)
38+
f1 --> |Filtering| f2(Frame)
39+
40+
end
41+
subgraph c["Buffer Conversion"]
42+
subgraph ff[" "]
43+
f3(Frame)
44+
f4(Frame)
45+
f5(Frame)
46+
end
47+
ff --> |Buffer Conversion| b2(Buffer)
48+
end
49+
Demuxing --> Decoding --> c
50+
51+
52+
Demuxing
53+
~~~~~~~~
54+
55+
The `demuxing <https://en.wikipedia.org/wiki/Demultiplexer_(media_file)>`_
56+
(short for demultiplexing) is a process to split the input data into smaller chunks.
57+
58+
For example, a video has multiple media streams (typically one audio and one video) and
59+
they are storead as a series of data called "packet".
60+
61+
The demuxing is a process to find data boundary and extract these packets one-by-one.
62+
63+
.. mermaid::
64+
65+
block-beta
66+
columns 1
67+
b["0101010101100101...................................."]
68+
space
69+
block:demuxed
70+
p0[["Header"]]
71+
p1(["Audio 0"])
72+
p2["Video 0"]
73+
p3["Video 1"]
74+
p4["Video 2"]
75+
p5["Video 3"]
76+
p6(["Audio 1"])
77+
p7["Video 4"]
78+
p8["Video 5"]
79+
end
80+
b-- "demuxing" -->demuxed
81+
82+
Decoding
83+
~~~~~~~~
84+
85+
Multi-media files are usually encoded to reduce the file size.
86+
The decoding is a process to recover the media from the encoded data.
87+
88+
The decoded data are called frames, and they contain waveform samples (audio)
89+
or image frame samples (image/video).
90+
91+
Buffer Conversion
92+
~~~~~~~~~~~~~~~~~
93+
94+
The buffer conversion is the step to merge multiple frames into one
95+
contiguous memory so that we can handle them as an array data.
96+
97+
Bitstream Filtering
98+
~~~~~~~~~~~~~~~~~~~
99+
100+
The bitstream filtering is a process to modify packets.
101+
You can refer to
102+
`FFmpeg Bitstream Filters Documentation <https://ffmpeg.org/ffmpeg-bitstream-filters.html>`_
103+
for available operations.
104+
105+
In SPDL, the most relevant operations are ``h264_mp4toannexb`` and
106+
``hevc_mp4toannexb``, which are necessary when using GPU video decoding.
107+
See :ref:`gpu-video-decoder`
108+
109+
Filtering
110+
~~~~~~~~~
111+
112+
The frame filtering is a versatile process.
113+
It can apply many different operations.
114+
115+
Please refer to `FFmpeg Filters Documentation <https://ffmpeg.org/ffmpeg-filters.html>`_
116+
for available filters.
117+
118+
By default it is used to change the output format.
119+
You can also apply augmentation using filters.
120+
See :ref:`augmentation` for the detail.

docs/source/io/gpu_decoder.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
.. _gpu-video-decoder:
2+
3+
GPU Video Decoder
4+
=================

0 commit comments

Comments
 (0)