- Similarity Search [Project Page]
- Large-scale Computation Platform
- Performance Improvement for Parallel and Distributed Systems
Large-scale All-pairs Similarity Search
- Scalable all-to-all similarity comparison with dissimilarity partitioning, load balancing, and runtime optimizations.
- Application benchmarks include web document search, tweet clustering, news grouping, and music recommendation.
- Techniques include hybrid data structure with cache-conscious data layout and traversal, two-stage load balancing, and fast dissimilarity detection for I/O and communication reduction.
- Implemented on Apache Hadoop platform in Java [Source Code], as well as Hadoop Pipes in C++.
- 10~25x faster than previous work. Memory hierarchy optimization adds another 2.7x speedup.
Cache-Conscious Fast Ranking for Tree-based Ensembles
- A fast execution system with memory hierarchy optimization to rank documents using learning ensembles such as Gradient Boosting Regression Trees (GBRT).
- 2~6x faster than previous work without loss of ranking accuracy measured by normalized discounted cumulative gain (NDCG).