Apache Arrow project uses regular expression matching to implement the LIKE
predicate by translating SQL pattern into a regular expression pattern:
let re_pattern = pat.replace("%", ".*").replace("_", ".");
While this approach works it has issues:
replace
function always allocates a new string, even ifpat
contains neither%
nor_
leading to wasted memory and allocation overhead- each
replace
invocation results in a full scan of thepat
resulting in unnecessary repeated work
It's a fairly common problem, so unsurprisingly it has a generic solution - function call fusion. Thanks to referential transparency this optimization has been available for Haskell developers. For example, map
fusion transforms a composition of map
s
map g . map f
is transformed into a map
of compositions:
map (g . f)
which is more efficient since we iterate over input list and produce the output list only once.
So what does it mean for the original problem with SQL pattern translation? To compare the efficiency, we'll fuse repeated replace
invocations in the original implementation
pub fn naive_sql_to_re(input: &str) -> String {
input.replace('%', ".*").replace('_', ".")
}
into a single function
pub fn sql_to_re(input: &str) -> String {
let mut output = String::with_capacity(input.len());
for c in input.chars() {
match c {
'%' => output.push_str(".*"),
'_' => output.push_str("."),
_ => output.push(c),
}
}
output
}
It's not the most efficient implementation but is sufficient to deliver significantly better performance for cases when there are and there are not any matches. Benchmarks
#[bench]
fn bench_naive_sql_to_re_no_match(b: &mut Bencher) {
b.iter(|| naive_sql_to_re("foobarbazz"))
}
#[bench]
fn bench_naive_sql_to_re_with_match(b: &mut Bencher) {
b.iter(|| naive_sql_to_re("foo%bar_bazz"))
}
#[bench]
fn bench_sql_to_re_no_match(b: &mut Bencher) {
b.iter(|| sql_to_re("foobarbazz"))
}
#[bench]
fn bench_sql_to_re_with_match(b: &mut Bencher) {
b.iter(|| sql_to_re("foo%bar_bazz"))
}
show almost 3X improvement for the case without matches and ~25% speed up for the case with matches even for a fairly short input strings, foobarbazz
and foo%bar_bazz
:
test tests::bench_naive_sql_to_re_no_match ... bench: 193 ns/iter (+/- 35)
test tests::bench_naive_sql_to_re_with_match ... bench: 277 ns/iter (+/- 40)
test tests::bench_sql_to_re_no_match ... bench: 71 ns/iter (+/- 11)
test tests::bench_sql_to_re_with_match ... bench: 205 ns/iter (+/- 30)
Until your compiler supports fuse optimization, it's an important and useful technique for improving performance and efficiency.