Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking
Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking
Blog Article
Storytelling is a remarkable human skill that plays a significant role in learning and experiencing everyday life.Developing narratives is central to human mental health development, simultaneously encapsulating broad details such as psychology, morality and common sense.Contemporary deep-learning algorithms require similar skills to be able to tell a story from a visual perspective.However, most algorithms function at a superficial or factual level, aligning descriptive text with images in a one-to-one manner without considering the temporal relation.Stories are more expressive in style, language and content, involving imaginary concepts not explicit dodge warlord for sale in the images.
An ideal deep learning system should learn and develop cohesive, meaningful, and causal stories.Unfortunately, most existing storytelling methods are trained and evaluated on a single dataset, i.e., the VIsual STorytelling (VIST) dataset.Multiple datasets are essential to test the generalization ability of algorithms.
We bridge the gap and present a new dataset for expressive and coherent story creation.We present the Sequential Storytelling Image Dataset (SSID, https://ieee-dataport.org/documents/sequential-storytelling-image-dataset-ssid) consisting of open-source video frames accompanied by story-like annotations.We provide four annotations (stories) for each set of five images.The image sets are collected manually from publicly available videos in three domains: documentaries, lifestyle, and movies, and then annotated manually using Amazon Mechanical Turk.
We perform a detailed analysis and benchmarking of the current VIST dataset and our new SSID dataset and show that both datasets exhibit high variance within their multiple ground truth stories corresponding to the same image set.Moreover, our dataset achieves lower mean average scores across all metrics, click here meaning that the ground truth stories of our dataset are more diverse.Finally, we train and evaluate existing state-of-the-art rhetorical storytelling methods on both datasets and show that our dataset is more challenging and requires sophisticated techniques to accurately detect a significant variety of events.