May 20-22, 2010

Optimising Locality-Sensitive Hashing on Sequences in the Context of Motif Finding

Author: Warren Cheung

Download PDF

Abstract:
Locality-sensitive hashing has been applied in several problems in
bioinformatics to quickly search large sequences by examining the
n-grams of the sequences.  We observe in these application a bias in
the random generation of hashes and define an effective equivalence
for hashes that generate nearly identical results.  Methods that
address these two points demonstrate an improvement to the rate at
which unique hashes are generated, especially when many hashes are
generated.  These methods can be easily applied integrated with
existing applications of locality-sensitive hashing, resulting in
improved performance by reducing the time algorithms spend revisiting
duplicate hashes as well as eliminating biases potentially hindering
the solving of certain problem instances.