Resize

The resizing feature relies on three subcomponents:

Speaker diarization with Pyannote
Scene change detection with PySceneDetect
Face detection with MTCNN and MediaPipe

These libraries are leveraged to dynamically resize a video to focus on whoever is speaking at any given moment. For a detailed explanation of the algorithm, read here.

Usage

The following returns the information to be able to resize the video.

from clipsai import resize

crops = resize(
    video_file_path="/abs/path/to/video.mp4",
    pyannote_auth_token="pyannote_token",
    aspect_ratio=(9, 16)
)

print("Crops: ", crops.segments)

To resize the video using the returned crops, run the following code.

media_editor = clipsai.MediaEditor()

# use this if the file contains video stream only
media_file = clipsai.VideoFile("/abs/path/to/video_only_file.mp4")
# use this if the file contains both audio and video stream
media_file = clipsai.AudioVideoFile("/abs/path/to/video.mp4")

resized_video_file = media_editor.resize_video(
    original_video_file=media_file,
    resized_video_file_path="/abs/path/to/resized/video.mp4",  # doesn't exist yet
    width=crops.crop_width,
    height=crops.crop_height,
    segments=crops.to_dict()["segments"],
)

Resize Function

Source Code

Name
resize
Type
-> Crops
Description
Dynamically resizes a video to a specified aspect ratio (default 9:16) to focus on the current speaker

Required Parameters

Name
video_file_pathstring
Description
Absolute path to the video file to resize.
Name
pyannote_auth_tokenstring
Description
Authentication token for Pyannote, obtained from HuggingFace.

Optional Parameters

Name
aspect_ratiotuple[int, int] = (9, 16)
Description
The target aspect ratio for resizing the video (width, height). Default is (9, 16).
Name
min_segment_durationfloat = 1.5
Description
The minimum duration in seconds for a diarized speaker segment to be considered. Default is 1.5.
Name
samples_per_segmentint = 13
Description
The number of samples to take per speaker segment for face detection. Default is 13. Reduce this for faster performance (at the sake of worse accuracy).
Name
face_detect_widthint = 960
Description
The width in pixels to which the video will be downscaled for face detection. Smaller widths detect faster, but may be less accurate. Default is 960.
Name
face_detect_marginint = 20
Description
Margin around detected faces, used in the MTCNN face detector. Default is 20.
Name
face_detect_post_processbool = False
Description
If set to True, post-processing is applied to the face detection output to make it appear more natural. Default is False.
Name
n_face_detect_batchesint = 8
Description
Number of batches for processing face detection when using GPUs. This is vital for proper memory allocation. Default is 8.
Name
min_scene_durationfloat = 0.25
Description
Minimum duration in seconds for a scene to be considered during scene detection. Default is 0.25.
Name
scene_merge_thresholdfloat = 0.25
Description
Threshold in seconds for merging scene changes with speaker segments. Default is 0.25.
Name
time_precisionint = 6
Description
Precision (number of decimal places) for start and end times of the segments. Default is 6. Less than 4 decimal places may result in rounding errors for the purposes of cropping the video with ffmpeg.
Name
devicestring: cuda | cpu = None
Description
PyTorch device to perform computations on. Default is None, which auto detects the correct device.

Represents the resizing information for an entire video including the video's original width and height dimensions, the video's resized width and height dimensions, and the segments of the video for focusing on the current speaker. Segments are defined over an interval of time, providing the x-y coordinate of the top left corner of a rectangle with pixel dimensions crop_width by crop_height to focus on the current speaker.

Properties

Name
crop_width
Type
int
Description
The width of the resized video in number of pixels.
Name
crop_height
Type
int
Description
The height of the resized video in number of pixels.
Name
original_width
Type
int
Description
The width of the original video in number of pixels.
Name
original_height
Type
int
Description
The height of the original video in number of pixels.
Name
segments
Type
List[Segment]
Description
The list of Segments providing the crop coordinates and times.

Methods

Name
copy
Type
-> Crops
Description
Returns a copy of the Crops instance.
Name
to_dict
Type
-> dict
Description
Returns a dictionary representation of the Crops instance.

Segment Class

Source Code

Segments are defined over an interval of time in the video, providing the x-y coordinate of the top left corner of a rectangle with pixel dimensions crop_width by crop_height to focus on the current speaker.

Properties

Name
x
Type
int
Description
The x coordinate of the top left corner of the crop from the original video.
Name
y
Type
int
Description
The y coordinate of the top left corner of the crop from the original video.
Name
start_time
Type
float
Description
The start time of the segment in seconds.
Name
end_time
Type
float
Description
The end time of the segment in seconds.
Name
speakers
Type
List[int]
Description
Returns a list of speaker identifiers in this segment. Each identifier uniquely represents a speaker in the entire video.

Methods

Name
copy
Type
-> Segment
Description
Returns a copy of the Segment instance.
Name
to_dict
Type
-> dict
Description
Returns a dictionary representation of the Segment properties.

Resize

Usage

Resize Function

Crops Class

Properties

Methods

Segment Class

Properties

Methods