Text-Based Temporal Localization of Novel Events
"Recent works on text-based localization of moments have shown high accuracy on several benchmark datasets. However, these approaches are trained and evaluated relying on the assumption that the localization system, during testing, will only encounter events that are available in the training set (i.e., seen events). As a result, these models are optimized for a fixed set of seen events and they are unlikely to generalize to the practical requirement of localizing a wider range of events, some of which may be unseen. Moreover, acquiring videos and text comprising all possible scenarios for training is not practical. In this regard, this paper introduces and tackles the problem of text-based temporal localization of novel/unseen events. Our goal is to temporally localize video moments based on text queries, where both the video moments and text queries are not observed/available during training. Towards solving this problem, we formulate the inference task of text-based localization of moments as a relational prediction problem, hypothesizing a conceptual relation between semantically relevant moments, e.g., a temporally relevant moment corresponding to an unseen text query and a moment corresponding to a seen text query may contain shared concepts. The likelihood of a candidate moment to be the correct one based on an unseen text query will depend on its relevance to the moment corresponding to the semantically most relevant seen query. Empirical results on two text-based moment localization datasets show that our proposed approach can reach up to 15% absolute improvement in performance compared to existing localization approaches."