It goes without saying that we spend loads of effort on developing scripts for crawling, monitoring, extracting and fetching the necessary data and content related to the target task, and then even more effort on curating, cleaning and labeling (aka annotating) the collected datasets. Especially, content labelling is particularly challenging due to the subjective nature of the task, e.g. different people may perceive the same content as belonging to different categories, while there are additional issues in specific annotation tasks, e.g. when dealing with NSFW and disturbing content.
Putting all the above issues aside, a major challenge that we, scientists, face when dealing with data and content that has been sourced from online sources is its ephemeral nature. Online data and information may cease to be available at its source. For instance, YouTube users may opt to delete one or more of their previously uploaded videos for a variety of reasons. An even more common case is that the media platforms decide to take down content due to violation of their terms of service or copyright infringement. Such a case happened in April 2018 when YouTube removed 5 million videos on the basis of content violation.
To demonstrate the issue we face, we would like to share our experience and concerns regarding two datasets that we recently (2019) created to support research on the problem of online media verification:
- The Fake video corpus (FVC) , a dataset of verified and debunked user-generated videos from YouTube, Facebook and Twitter. The FVC dataset contained 200 fake video cases and 180 real video cases. Following a semi-automatic procedure, as described in , we collected 3,262 near duplicates of the above fake videos and 1,933 of the above real videos. By the time it was publicly released, February 2019, in total 5,575 videos from YouTube, Facebook and Twitter were available online.
- The FIVR-200K  dataset was collected to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). It offers a single means of evaluating several video retrieval tasks as special cases. The FIVR-200K, which was collected in March 2018, consisted of 225,960 YouTube videos. The videos were collected from January 1st 2013 to December 31st 2017 following the procedure described in .
|Total videos||Available videos (January 2020)||Unavailable videos|
For that reason, the unavailable videos had to be removed from the corresponding dataset releases. The issue of online content ephemerality concerns the research community since the effort to create such a dataset is large and the reproducibility of the corresponding experiments is really harmed. Potential solutions to the issue that we see adopted by the research community include the following among others:
- Release extracted features from the original media collection: This is a practice that is common in the computer vision and multimedia community. Given the prevalence of standard and widely used feature extractors, both manually crafted (SIFT, SURF) and neural network based (VGG, ResNet, Xception, etc.), it is still possible to perform a variety of interesting experiments when one has access to the extracted features and not the original content. On the downside, maintaining and distributing a variety of features from massive media collections is expensive in terms of storage and bandwidth, while the rapid development of new feature extractors is expected to soon make such releases obsolete.
- Release reduced versions of the original media collection: This involves the periodic updating of the original collection with the purpose of removing any media items that are not available online anymore. In cases where a sizeable portion of the original collection remains available, this is an acceptable approach. However, as became clear from our above discussion, large reductions (in the order of 20%) are very likely soon (one year) after a dataset is released. This is especially problematic when the original dataset contains low-frequency classes that are of interest to the target task.
- Retaining and privately sharing of the original collection: Even though this practice is in clear violation of most platforms’ Terms of Service and pertinent regulation (including Copyright Law and GDPR), we regularly see this happen among members of the research community. This is because it is the only approach that ensures full reproducibility of research based on the original collection. The fact that this practice is still common despite its lack of legal basis may be telltale of the researchers’ agony to ensure that their research remains reproducible and relevant in the long run.
 Papadopoulou, O., Zampoglou, M., Papadopoulos, S. & Kompatsiaris, I. (2019), “A corpus of debunked and verified user-generated videos”, Online Information Review, Vol. 43 No. 1, pp. 72-88.
 Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I. (2019), “FIVR: Fine-grained Incident Video Retrieval”. IEEE Transactions on Multimedia 21(10), 2638-2652.