AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


KALCHAS - a Dynamic Full-Text XML Search Engine

Authors

; ;

Term

4. term

Publication year

2005

Abstract

Denne afhandling præsenterer et dynamisk fuldtekstsindeks for XML-dokumenter, som understøtter enkel forespørgselssyntaks og hyppige opdateringer. XML er et træstruktureret format, så vi indekserer ikke kun ord, men også hvor hvert element befinder sig i træet, så vi kan returnere relevante dele af et dokument. Vi designer et kaskaderende indeks med tre lag af inverterede indeks (lister, der kortlægger termer til de dokumenter eller elementer, der indeholder dem): et hovedhukommelsesindeks for det aktive arbejdssæt, et diskbaseret B-træ, der opdateres inkrementelt for dokumenter, som netop er flyttet ud af hukommelsen, og et statisk diskbaseret B-træ for dokumenter, der har været uændrede i længere tid. Vi beskriver vedligeholdelsesstrategier, der effektivt migrerer data fra hukommelse til disk. For at holde indekset lille undersøger vi komprimeringsmetoder og sammenligner komprimering af hele inverterede lister med komprimering af individuelle poster. Vi foreslår også en specialtilpasset variabel bytelængde-kodning til at lagre Dewey-stier—hierarkiske identifikatorer, der beskriver hvert elements placering i XML-træet—mere effektivt. For at forbedre resultaternes kvalitet anvender vi Meet-operatoren til at filtrere og returnere de mest relevante XML-elementer, og vi forfiner vores tidligere Meet-algoritme for yderligere at øge relevansen. Empiriske tests viser, at det implementerede system yder rimeligt i det tilsigtede miljø.

This thesis presents a dynamic full-text index for XML documents that supports a simple query syntax and frequent updates. XML is a tree-structured format, so we index not only words but also where each element sits in the tree to return relevant parts of a document. We design a cascading index made of three layers of inverted indexes (lists that map terms to the documents or elements that contain them): a main-memory index for the active working set, a disk-based B-tree that is updated incrementally for documents recently moved out of memory, and a static disk-based B-tree for documents that have stayed unchanged for a long time. We describe maintenance strategies that efficiently migrate data from memory to disk. To keep the index small, we study compression methods and compare compressing entire inverted lists with compressing individual postings. We also propose a custom variable-byte encoding to store Dewey paths—hierarchical identifiers that describe each element’s position in the XML tree—more efficiently. To improve result quality, we use the Meet operator to filter search results and return the most relevant XML elements, and we refine our earlier Meet algorithm to further increase relevance. Empirical tests indicate that the implemented system performs reasonably in its intended environment.

[This abstract was generated with the help of AI]