Video Verification
The growing use of the Internet in our everyday life and the uncontrolled dissemination of user-generated videos (UGVs) through social media and video platforms raises increasing concerns about the spread of disinformation. Given a set of debunked and verified UGVs, the participants should develop an system that decides whether the video is accurate ("real") or misleading ("fake").
Challenge Description
Develop a system that will label user-generated videos as real or fake
The goal of an automatic video verification system is to provide a probability score that represents the level of credibility for a Web video.
Existing approaches for tweet verification [1] had proven that textual features (number of words, number of uppercase characters, sentence length etc.) extracted by the tweet text can characterize the tweet as real or fake. We adapt this approach on video title and description and trained a svm classifier.
Fake video corpus
For this task, we created a subset of the Fake Video Corpus (FVC) [2] with videos that come from YouTube and are available online.
The videos are organized in cascades where a cascade consists of the first instance of a video and near duplicate instances that convey the same or almost the same content. The dataset is split into a training and test set. During the training and test sets creation, the videos have been partitioned in such a way that all videos of a cascade are part of the same set.Baseline approach
The features of Table 1 are extracted and used to train a two-class RBF SVM. These include features that describe the uploader channel, and also text-based features from the video title.
From video title | From channel metadata |
---|---|
Text length | Channel view count |
Number of words | Channel subscriber count |
Contains question mark (boolean) | Channel video count |
Contains exclamation mark (boolean) | Channel comment count |
Contains 1st person pronoun (Boolean) | |
Contains 2nd person pronoun (Boolean) | |
Contains 3rd person pronoun (Boolean) | |
Number of uppercase characters | |
Number of positive sentiment words | |
Number of negative sentiment words | |
Number of slang words | |
Has ’:’ symbol (Boolean) | |
Number of question marks | |
Number of exclamation marks |
Provided code:
- LoadData.py: loads both training and test data. You can select between loading the existing features of Table 1 or the metadata responses which contains the video metadata (video title, description, comments etc.)
- Training.py: executes the baseline method by loading the pre-extracted features.
- FeatureExtraction.py: takes the video title and channel id as input and extracts the features of Table 1. YouTube API key is required.
- Youtube_api_connection.py: calls the YouTube API if a unseen video is submitted. YouTube API key is required.
- Evaluation.py: provide a txt file with the results of your approach (video_id prediction actual) and get the F-score result.
Input
The dataset contains in total 330 (181 fake / 149 real) cascades split in 230 (126 fake/104 real) cascades for training and 100 (55 fake/ 45 real) for testing.
Number of videos:
- Training set : 1530 (1006 fake / 524 real)
- Test set: 675 (395 fake / 280)
Provided files:
- train_idx.txt and test_idx.txt contain the ids of the cascade for the training and test set respectively.
- cascade_ids_all.txt contains the ids of the videos and the cascade that they belong to.
- yt_vf.csv contains the features of Table 1.
Output
- The output should contain a binary label [0 1] for all videos of the test set. Result file format: 'video_id' 'prediction' (tab seperated, one line per video)
- Predictions should be evaluated through the calculation of F-score, in order to be compared to current best results.
- Create a .zip file which contains your code and a text file with the predictions.
Upload
- On the right side of the webpage there is the Hackathonist Details field. Fill your first name, last name and email.
- Upload your .zip file.
- Click the Submit button.
Leaderboard
Accept the challenge and achieve better results!
Team name | Run | Precision | Recall | F-score |
---|---|---|---|---|
MKLab | Baseline | 0.63 | 0.93 | 0.75 |
Further reading:
Word embeddings is used in order to capture as much of the semantical/ morphological/ context/ hierarchical/ etc. information as possible. Depending on the task a method is better than the others.
References:
Contact:
Download Code Files
Download ContVer.zip to get the provided Input Data and code.