“Unveiling the Transformation of Online News Headlines: A Comparative Analysis”
Analyze multiple datasets contributing to a collection of news headlines from various outlets. Explore features such as headline length, sentiment, and syntactic structure to understand trends in journalism.
“
In our investigation, we gathered data from various datasets to analyze headlines over time. The BIG4 dataset includes headlines from well-known news outlets like The New York Times and The Guardian from the early 2000s, capturing the transition to online journalism. Additionally, the News on the Web corpus (NOW) offers a broader range of news websites from 2010 onwards. We also incorporated datasets of clickbait headlines and scientific preprint titles for comparison.
The BIG4 corpus covers a wide range, including headlines from The New York Times dating back to 1851, signaling the shift to online platforms. However, the ABC Australia dataset presented some challenges with discrepancies in headline matching. Despite these issues, we included this dataset in our analyses for a comprehensive review.
The NOW corpus, sourced from various English-language news websites, continuously updates with new articles to reflect the current media discourse. It provides a dynamic view of news over time, highlighting different outlets’ contributions to the dataset. Our analysis also included a clickbait-style corpus and a corpus of scientific preprint titles for benchmarking purposes.
Utilizing natural language processing techniques, we cleaned and analyzed the headlines, including sentiment analysis and syntactic structure examination. Our statistical analysis involved linear regressions for continuous features and logistic regressions for binary features to understand trends over time.
We categorized outlets based on political leaning and journalistic quality using established media bias charts. The AllSides Media Bias Chart categorized outlets into left-leaning, center, and right-leaning, while the Ad Fontes Media Bias Chart assessed journalistic quality based on a green-yellow-red scale.
Our findings revealed a significant increase in headline length over time, prompting further exploration of its correlation with other linguistic features. We acknowledge the complex causal relationships within the data-generating process and focus on analyzing descriptive trends without implying causality.
We provide open access to our code and datasets for transparency and reproducibility in our analyses.
Published on: 2025-03-13 00:00:00 | Author: