Third International Workshop :
Grid and
Peer-to-Peer Computing Impacts on Large Scale Heterogeneous Distributed
Database Systems
(GLOBE'2006)
Invited talk : Professor
Witold Litwin, CERIA,
Université Paris Dauphine, France
Pattern Matching Using n-gram Sampling
Of Cumulative Algebraic Signatures : Preliminary Results
Authors: Witold Litwin , Riad
Mokadem, Philippe Rigaux & Thomas Schwarz
Abstract:
We propose a novel string (pattern)
matching algorithm called n-gram
search. We intend it for the records stored once and searched many
times in a database or a file, especially organized into a Scalable
Distributed Data Structure, (SDDS), over a grid or a structured P2P
net. We presume the records encoded into their cumulative algebraic
signatures, e.g., for the incidental confidentiality of stored data.
The search starts with the pre-processing of the pattern, calculating
the logarithmic algebraic signature (LAS) of the pattern and the LASs
of every n-gram in it. The value of n 1 is a
parameter that one may tune. The search attempts to match the
LASs of n-grams in the pattern towards dynamically calculated LASs,
sampled over n-grams in the records. A mismatch generates a shift of up
to K-n symbols towards next sample, where K is the pattern length. The
whole process is parallel over the SDDS servers and does not require
any local decoding. For an M-symbol long record, the unsuccessful
search, measured as number of match attempts, costs O ((M-K) /
(K-n+1)). The 2-grams should typically suffice, leading to O
((M-K) / (K-1)). We show the algorithm particularly efficient for
larger strings and records, i.e., with e-documents or DNA data.
Preliminary results show then the n-gram search about (K - n + 1)
faster than our previous algorithms and among the fastest known,
e.g., probably often faster than Boyer-Moore.
Keywords: SDDS, Grid,
Structured P2P, Scalable Distributed Pattern Matching, Algebraic
Signatures.
hameur@irit.fr - Workshop Home Page
Last modified : 23/06/2006 AH