AAU Student Projects - visit Aalborg University's student projects portal
An executive master's programme thesis from Aalborg University
Book cover


Finding patterns in car reviews, using text mining techniques

Author

Term

4. term

Publication year

2018

Abstract

This thesis examines whether the words used in car reviews reveal consistent patterns linked to a car’s rating or its manufacturer. A dataset of 529 reviews was collected from whatcar.com via web scraping, each carrying a 1–5 rating. The text was cleaned to remove repeated headlines, advertisements, hyperlinks, and embedded script elements, then processed to extract nouns and adjectives, with stop words removed and stemming applied. Using R, the study built document–term representations and similarity graphs, and applied clustering methods (hierarchical, k-means, DBSCAN, OPTICS) and association mining to explore patterns across manufacturers and rating groups. Minor similarities were observed among reviews of cars from the same manufacturer and, to a limited extent, within the same rating, but these were weak and inconsistent. Clustering generally suggested few coherent groups, and association analyses did not support robust rules. The main finding is that the nouns and adjectives in a car review cannot be reliably correlated with the car’s rating or manufacturer in this dataset; the study also reflects on the suitability of different clustering algorithms and outlines directions for future work.

Denne afhandling undersøger, om ordene i bilanmeldelser kan afsløre konsistente mønstre, der hænger sammen med bilens bedømmelse eller producent. Et datasæt med 529 anmeldelser blev indsamlet fra whatcar.com via web-scraping, hver med en 1–5-bedømmelse. Teksten blev renset for gentagne overskrifter, reklamer, hyperlinks og indlejrede script-elementer, hvorefter substantiver og adjektiver blev udtrukket; stopord blev fjernet og stemming anvendt. Ved brug af R blev der opbygget dokument-term-repræsentationer og lighedsgrafer, og der blev anvendt klyngealgoritmer (hierarkisk, k-means, DBSCAN, OPTICS) samt associationsmining til at udforske mønstre på tværs af producenter og bedømmelsesgrupper. Mindre ligheder blev observeret i anmeldelser af biler fra samme producent og, i begrænset omfang, med samme bedømmelse, men disse var svage og inkonsistente. Klyngning antydede generelt få sammenhængende grupper, og associationsanalyser understøttede ikke robuste regler. Hovedresultatet er, at substantiver og adjektiver i en bilanmeldelse ikke kan korreleres pålideligt med bilens bedømmelse eller producent i dette datasæt; studiet drøfter også algoritmernes egnethed og skitserer muligheder for fremtidigt arbejde.

[This apstract has been generated with the help of AI directly from the project full text]