Storing and Indexing Text

<div class="MsoNormal">Document indexing</div>

Running text

<div class="MsoNormal">Conventional inverted indexes for text</div>

<div class="MsoNormal">Suffix trees and suffix arrays</div>

 
     For example, indexing every word in this sentence yields the 12 strings:
     example, indexing every word in this sentence yields the 12 strings:
     indexing every word in this sentence yields the 12 strings:
     every word in this sentence yields the 12 strings:
     word in this sentence yields the 12 strings:
     in this sentence yields the 12 strings:
     this sentence yields the 12 strings:
     sentence yields the 12 strings:
     yields the 12 strings:
     the 12 strings:
     12 strings:
     strings:
 

Text structure

<div class="MsoNormal">Pointer-based encoding</div>

<div class="MsoNormal">Interval encoding for structured text</div>

Storing and indexing semistructured data

Recall OEM structure

o        symbol table mapping entry point names to OIDs

o        for each sink node, store the data type and the atomic value

o        for each compound object, store set of <label, targetOID> pairs representing outgoing edges (e.g., keep a table mapping edge label to set of target nodes)

o        finds atomic values rapidly

o        one index per data type

o        complication: type conversions

(e.g., ... spy where number > 5)

o        too much space and update cost to duplicate the graph

o        for each node, store set of OIDs from which at least one edge leads to this node

o        need to verify that edges have correct labels

o        mapping each path starting with an entry point name to the set of OIDs for target nodes (N.B. this is the same information as DataGuide)

o        could keep just a subset (fixed length paths? commonly used paths?)

o        index of subpaths (which ones? storing what info?)

o        possible application of suffix arrays?

References and related reading

Gonnet92

Li96

Zhou98

Barbay06

Abiteboul97

McHugh97

Goldman97

McHugh98