
Maowen Tang Master’s Thesis Defense, Wednesday, April 22, 2026 @ 11:00 am Central Time
April 22 @ 11:00 am - 12:00 pm
COMMITTEE CHAIR: Dr. Yonghui Wang
TITLE: STRUCTURED REPRESENTATION LEARNING FOR GENERALIZABLE DEEPFAKE VIDEO DETECTION
ABSTRACT Deepfake video detection has become an important problem in multimedia forensics as modern generative models produce increasingly realistic facial manipulations. Although many existing detectors achieve strong performance on the datasets on which they are trained, their performance often degrades substantially on unseen manipulation methods and evaluation conditions. A major reason for this limitation is the premature collapse of spatial structure: many vision transformer based detectors aggregate patch tokens into a single global representation, thereby suppressing the localized and temporally uneven forensic cues that characterize manipulated video. This thesis presents the Spatio-Temporal Slot Aggregation Network (ST-SAN), a video-level deepfake detection framework designed to preserve structured forensic evidence before final classification. ST-SAN extracts patch tokens from three intermediate layers of a frozen DINOv2 backbone and aligns them through a lightweight bottleneck projection. A K-slot soft aggregation module then forms multiple learned slot summaries for each frame, allowing the model to retain several localized views of manipulation evidence instead of collapsing all patch information into one vector. These slot features are further integrated through adaptive frame weighting and slot weighting so that frames and slot summaries with stronger forensic content contribute more to the final decision. Training is stabilized by structural regularization terms that encourage locality, orthogonality, diversity across slot summaries, and weak coverage. Experiments show that ST-SAN achieves 0.960 AUC on FaceForensics++ under in-domain evaluation. Under cross-domain evaluation, it achieves 0.917 AUC on Celeb-DF v2, 0.872 AUC on DeepFakeDetection, and 0.890 AUC on the DeepFake Detection Challenge Preview dataset, indicating competitive cross-domain performance on the reported benchmarks. Ablation results show that parallel soft slot aggregation is an important architectural component, while adaptive weighting and structural regularization help stabilize the full model. These findings indicate that preserving multiple localized forensic summaries prior to video-level classification is a promising strategy for improving robustness under cross-domain evaluation in deepfake video detection.
Keywords: Deepfake detection, video forensics, representation learning, vision transformer, slot aggregation, adaptive weighting.
Room Location: S. R. Collins Building, Room 111L

