AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Using media sentiment to forecast daily stock returns

Author

Term

4. term

Education

Publication year

2020

Submitted on

Pages

30

Abstract

Denne afhandling undersøger, om stemning udledt af tekstbaserede finansielle kilder kan forbedre prognoser for daglige kursændringer i Dow Jones Industrial Average. Projektet konstruerer daglige sentimentsindeks fra frit tilgængelige kilder som nyhedsartikler, fora og sociale medier (bl.a. New York Times, Reddit og Twitter) ved hjælp af en ordlistebaseret tekstanalytisk pipeline i R (tokenisering, fjernelse af stopord og optælling af ord pr. leksikonkategori). Fokus er på finansspecifikke ordlister (Loughran & McDonald), suppleret af generelle leksika (bing og AFINN), og på at aggregere store mængder sparsomme tekster på tværs af kilder og tekstfelter for at indfange mest mulig information. Den prognostiske ramme bygger på vektorautoregression (VAR), udvidet med dimensionsreduktion og regularisering via principal component analysis (PCA), partial least squares (PLS) og penaliserede regressioner for at håndtere multikollinearitet og overfitting. Forudsigelseskraften vurderes både in-sample og out-of-sample: Granger-kausalitet anvendes til at teste informationsindhold, mens Clark–West- og Pesaran–Timmermann-testene benchmarker henholdsvis fejlreduktion og retningstræf. Afhandlingen diskuterer også begrænsningerne ved leksikonmetoder (fx negation og ironi) og forskelle mellem generelle og finansspecifikke ordlister. De konkrete empiriske resultater er ikke indeholdt i dette uddrag og præsenteres senere i afhandlingen.

This thesis investigates whether sentiment extracted from text-based financial sources can improve forecasts of daily changes in the Dow Jones Industrial Average. It constructs daily sentiment indices from freely available news, forums, and social media (including the New York Times, Reddit, and Twitter) using a lexicon-based text analytics pipeline in R (tokenization, stop-word removal, and counting words by lexicon category). Emphasis is placed on finance-specific dictionaries (Loughran & McDonald), complemented by general-purpose lexicons (bing and AFINN), and on aggregating large volumes of sparse text across sources and fields to capture more information. The forecasting framework centers on vector autoregression (VAR) and is extended with dimensionality reduction and regularization via principal component analysis (PCA), partial least squares (PLS), and penalized regressions to address multicollinearity and overfitting. Predictive content is evaluated in-sample and out-of-sample: Granger causality tests are used to assess information content, while Clark–West and Pesaran–Timmermann tests benchmark forecast error reduction and directional accuracy. The thesis also discusses limitations of lexicon methods (e.g., negation and irony) and differences between general and finance-specific lexicons. Concrete empirical findings are not included in this excerpt and are reported later in the thesis.

[This summary has been generated with the help of AI directly from the project (PDF)]