Third International Workshop :

Grid and Peer-to-Peer Computing Impacts on Large Scale Heterogeneous Distributed Database Systems (GLOBE'2006)


Invited talk : Professor Witold Litwin, CERIA, Université Paris Dauphine, France

Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results


Authors: Witold Litwin , Riad Mokadem, Philippe Rigaux & Thomas Schwarz

Abstract:

We propose a novel string (pattern) matching algorithm called n-gram search. We intend it for the records stored once and searched many times in a database or a file, especially organized into a Scalable Distributed Data Structure, (SDDS), over a grid or a structured P2P net. We presume the records encoded into their cumulative algebraic signatures, e.g., for the incidental confidentiality of stored data. The search starts with the pre-processing of the pattern, calculating the logarithmic algebraic signature (LAS) of the pattern and the LASs of every n-gram in it. The value of  n  1 is a  parameter that one may tune. The search  attempts to match the LASs of n-grams in the pattern towards dynamically calculated LASs, sampled over n-grams in the records. A mismatch generates a shift of up to K-n symbols towards next sample, where K is the pattern length. The whole process is parallel over the SDDS servers and does not require any local decoding. For an M-symbol long record, the unsuccessful search, measured as number of match attempts, costs O ((M-K)  / (K-n+1)).  The 2-grams should typically suffice, leading to O ((M-K) / (K-1)). We show the algorithm particularly efficient for larger strings and records, i.e., with e-documents or DNA data. Preliminary results show then the n-gram search about (K - n + 1) faster than our previous algorithms and among the fastest known, e.g.,  probably often faster than Boyer-Moore.


Keywords: SDDS, Grid, Structured P2P, Scalable Distributed Pattern Matching, Algebraic Signatures.


hameur@irit.fr - Workshop Home Page
Last modified : 23/06/2006 AH