Skip to main content

SoBigData Articles

Researching existing indexes and data souces of problematic or factual content: TNA visit in Sheffield

Sheffield is a city of tailored made tools. Renowned for its steel craftsmanship, it was apparently once hailed as the steel-making capital of the world and even boosted in the Canterbury Tales for its cutlery-making. No pair for scissors will top the ones you can still until this day get in Sheffield, although customers are warned waiting time can take up to ten weeks for them to be ready for delivery. 

 

While the cutlery-making and blacksmithing still remain in Sheffield, they now mostly serve as reminders of the city’s rich steelmaking past. No doubt, the chapter in Sheffield’s craftsmanship history has turned. My short visit made me realise it has perhaps turned towards other tools – such as those produced by the GATE team at the University of Sheffield.

 

 

I have been staying in the “SteelCity '' as it's sometimes called, between December 2023 and March 2024. During this time, I had the pleasure of participating in the SoBigData++ project through a TNA visit organised by the Computer Science department at the University of Sheffield under the supervision of Prof. Kalina Bontcheva. It proved to be a fantastic opportunity organised around productive working and show-and-tell sessions, punctuated by coffee breaks in which I could get to know fellow researchers, the UOS staff and the city itself. 

 

 

The goal of this project was to collaborate with GATE team members to expand the labelling capacity of GATE’s domain credibility analysis tool. This service, simply put, gathers information from various sources to assist users in determining the credibility of a domain or social media account. 

 

The judgement, however, is not facilitated by the tool but is provided by multiple organisations, such as fact-checking or media monitoring institutions, which work to assess the credibility of online content. At the moment the URL domain service collects data from the following sources: 

 

  • OpenSources
  • Duke Reporters’ Lab
  • The Database of Known Fakes
  • Global Disinformation Index
  • EuVsDisinfo

 

 

These are categorised as "positive," "caution," or "mixed" sources. Note that this classification does not imply that the sources themselves are positive or negative. Rather, it indicates whether each source assigns positive, negative, or both types of labels to an outlet or a publisher. For example, an AFP fact-checking URL will be labelled as ‘positive’ because it is associated with a high-profile publisher accredited by IFCN and produces content with thorough scrutiny. Conversely, a website previously fact-checked and flagged for lacking transparency and having low editorial standards will be classified as a ‘negative source’. 

 

GATE url annotation tools already have a robust collection of available sources, some of which are continuously updated. However, researchers have admitted that while the tool holds significant research value, it currently labels less than 10% of their data on average, making it often impractical as a stand-alone annotation technique. My task during the tenure of the SoBigData++ TNA visit was to conduct research on organised data from similar, credible institutions, that could be used for improving the service. This data needed to be either publicly available, or we wanted to enter into partnerships with these organisations. 

 

Throughout the time I was working closely with PhD candidate Iknoor Singh and Research Software Engineer, Ian Roberts. Though part of our work involved trying to eliminate bugs and writing a new Github repository that could be deployed by more technical users, the main focus was on expanding the utility of the tool. One of the pivotal moments was when our research team agreed to expand the databases not only by the negative sources so to speak, but also the positive ones. We started to think creatively about end-user goals. If a user wants to know which sources are problematic, it also helps to know which aren’t. Assuming that the Domain Analysis service is being used to support research in communication, media or political studies, then by integrating an elimination process, we will be providing researchers with useful cues. 

 

As a result, we decided to expand our search and explore opportunities to integrate existing databases of regulated media within the EU and beyond. Such lists are typically curated by dedicated media watchdogs in various countries, as well as independent institutions. For instance, in the UK, this responsibility falls on the Independent Press Standards Organisations, that we have reached out to for a potential collaboration. In Germany, access to such a database is facilitated through the BDZV (Der Bundesverband Digitalpublisher und Zeitungsverleger). But there are also hobbyist projects, and charities such as the Public Interest News Foundation from the UK, which has shared their list of local media institutions with us. Simultaneously, I have been actively seeking out new 'negative' sources to enhance the tool's annotation capabilities, as per the initial plan. One significant discovery in this regard was the Iffy Index of Unreliable Sources, which aggregates credibility ratings provided by Media Bias/Fact Check. Iknoor, who oversees the technical side of the project, is currently integrating these databases into the service. 

 

Sheffield in winter is a frosty business. But, the warm welcome I received from the entire team of brilliant and supportive academics at Sheffield University overshadowed the cold weather. The TNA visit held the promise of opening doors to new perspectives and potential future research partnerships. After completing the visit, I can say that this promise was fulfilled. The TNA provided me with an invaluable opportunity to meet new people who share similar interests and expertise. They were always eager to share their knowledge and collaborate, making the experience truly enriching. I do hope we will continue our work together in the future on other projects. 

 

I also believed the research gave us an opportunity to refresh work on something that hasn’t been quite finished. Such incomplete and partial creations are not only a developer's nightmare, but also a researcher's nuisance. As much as I will always champion new tools and ideas for disinformation research, I also feel like as researchers we need to do better in terms of achieving synergy when working on new projects, by making the use or repurposing existing IPs.