Resize

The resizing feature relies on three subcomponents:

  1. Speaker diarization with Pyannote
  2. Scene change detection with PySceneDetect
  3. Face detection with MTCNN and MediaPipe

These libraries are leveraged to dynamically resize a video to focus on whoever is speaking at any given moment. For a detailed explanation of the algorithm, read here.


Usage

The following returns the information to be able to resize the video.

from clipsai import resize

crops = resize(
    video_file_path="/abs/path/to/video.mp4",
    pyannote_auth_token="pyannote_token",
    aspect_ratio=(9, 16)
)

print("Crops: ", crops.segments)

To resize the video using the returned crops, run the following code.

media_editor = clipsai.MediaEditor()

# use this if the file contains video stream only
media_file = clipsai.VideoFile("/abs/path/to/video_only_file.mp4")
# use this if the file contains both audio and video stream
media_file = clipsai.AudioVideoFile("/abs/path/to/video.mp4")

resized_video_file = media_editor.resize_video(
    original_video_file=media_file,
    resized_video_file_path="/abs/path/to/resized/video.mp4",  # doesn't exist yet
    width=crops.crop_width,
    height=crops.crop_height,
    segments=crops.to_dict()["segments"],
)

Resize Function

Source Code

  • Name
    resize
    Type
    -> Crops
    Description

    Dynamically resizes a video to a specified aspect ratio (default 9:16) to focus on the current speaker

Required Parameters

  • Name
    video_file_pathstring
    Description

    Absolute path to the video file to resize.

  • Name
    pyannote_auth_tokenstring
    Description

    Authentication token for Pyannote, obtained from HuggingFace.

Optional Parameters

  • Name
    aspect_ratiotuple[int, int] = (9, 16)
    Description

    The target aspect ratio for resizing the video (width, height). Default is (9, 16).

  • Name
    min_segment_durationfloat = 1.5
    Description

    The minimum duration in seconds for a diarized speaker segment to be considered. Default is 1.5.

  • Name
    samples_per_segmentint = 13
    Description

    The number of samples to take per speaker segment for face detection. Default is 13. Reduce this for faster performance (at the sake of worse accuracy).

  • Name
    face_detect_widthint = 960
    Description

    The width in pixels to which the video will be downscaled for face detection. Smaller widths detect faster, but may be less accurate. Default is 960.

  • Name
    face_detect_marginint = 20
    Description

    Margin around detected faces, used in the MTCNN face detector. Default is 20.

  • Name
    face_detect_post_processbool = False
    Description

    If set to True, post-processing is applied to the face detection output to make it appear more natural. Default is False.

  • Name
    n_face_detect_batchesint = 8
    Description

    Number of batches for processing face detection when using GPUs. This is vital for proper memory allocation. Default is 8.

  • Name
    min_scene_durationfloat = 0.25
    Description

    Minimum duration in seconds for a scene to be considered during scene detection. Default is 0.25.

  • Name
    scene_merge_thresholdfloat = 0.25
    Description

    Threshold in seconds for merging scene changes with speaker segments. Default is 0.25.

  • Name
    time_precisionint = 6
    Description

    Precision (number of decimal places) for start and end times of the segments. Default is 6. Less than 4 decimal places may result in rounding errors for the purposes of cropping the video with ffmpeg.

  • Name
    devicestring: cuda | cpu = None
    Description

    PyTorch device to perform computations on. Default is None, which auto detects the correct device.


Crops Class

Source Code

Represents the resizing information for an entire video including the video's original width and height dimensions, the video's resized width and height dimensions, and the segments of the video for focusing on the current speaker. Segments are defined over an interval of time, providing the x-y coordinate of the top left corner of a rectangle with pixel dimensions crop_width by crop_height to focus on the current speaker.

Properties

  • Name
    crop_width
    Type
    int
    Description

    The width of the resized video in number of pixels.

  • Name
    crop_height
    Type
    int
    Description

    The height of the resized video in number of pixels.

  • Name
    original_width
    Type
    int
    Description

    The width of the original video in number of pixels.

  • Name
    original_height
    Type
    int
    Description

    The height of the original video in number of pixels.

  • Name
    segments
    Type
    List[Segment]
    Description

    The list of Segments providing the crop coordinates and times.

Methods

  • Name
    copy
    Type
    -> Crops
    Description

    Returns a copy of the Crops instance.

  • Name
    to_dict
    Type
    -> dict
    Description

    Returns a dictionary representation of the Crops instance.


Segment Class

Source Code

Segments are defined over an interval of time in the video, providing the x-y coordinate of the top left corner of a rectangle with pixel dimensions crop_width by crop_height to focus on the current speaker.

Properties

  • Name
    x
    Type
    int
    Description

    The x coordinate of the top left corner of the crop from the original video.

  • Name
    y
    Type
    int
    Description

    The y coordinate of the top left corner of the crop from the original video.

  • Name
    start_time
    Type
    float
    Description

    The start time of the segment in seconds.

  • Name
    end_time
    Type
    float
    Description

    The end time of the segment in seconds.

  • Name
    speakers
    Type
    List[int]
    Description

    Returns a list of speaker identifiers in this segment. Each identifier uniquely represents a speaker in the entire video.

Methods

  • Name
    copy
    Type
    -> Segment
    Description

    Returns a copy of the Segment instance.

  • Name
    to_dict
    Type
    -> dict
    Description

    Returns a dictionary representation of the Segment properties.