Searching

by Malcolm Phillips

A fundamental operation that comes up in computing, time and time again, is the problem of searching. More specifically, I mean searching a dataset for an arbitrary record, as opposed to searching for the one with the minimum or maximum key. This article is aimed at items which are stored in array, as opposed to a list, tree, trie, or graph etc.

Some techniques for searching are as follows, and I shall discuss them in this order:

bullet Linear sequential search
bullet Sentinel search
bullet Random searching
bullet Probability search
bullet Sorted sequential search
bullet Binary search
bullet Interpolation search
bullet Hopeful search
bullet Hashed search
bullet Reluctant Search

Linear sequential search

This is the most naive and simple of algorithms. All that is involved is examining each item in turn to see if it matches the search criteria. You either find a match at some point, or reach the end without finding a match after examining every item. On average, half of the items need to be examined, assuming the item is present. If it is not present, then it will always require N comparisons. The few advantages to such a technique are that:

No setup steps are required.

All items need not be accessible at once. The data can be 'streamed', e.g. from disk.

For most purposes, this method is too slow, except in cases where the number of items is less than about 10. Every additional item extends the average search time by the same amount. Lastly, it works just as well on linked-lists too!

Oh by the way, NOT_FOUND can simply be defined as -1.

[hide source code]

template <class T, class TKey>
int SequentialSearch(const T a[], int n, const TKey &find) {
    for (int i=0; i < n; ++i)
        if (a[i] == find)
            return i;
    return NOT_FOUND;
}

http://en.wikipedia.org/wiki/Linear_search

Sentinel search

One way of further optimising a simple sequential search is to place a copy of the item being searched for (a sentinel), at the end of the array so that it is always present. Depending on how the search is implemented, this can shave off a few CPU instructions. E.g. the for-loop test i < n as in sequential-search would no longer be required.
However it is not always feasible to modify the target array for this purpose, because you have to plan in advance to always have room for an extra item at the end. Therefore I've implemented it with a slight variation. I first check against the last item, and then remove it and replace it with the sentinel, then put it back afterwards. This will only pay off if the array has a large number elements, and they aren't expensive to copy. It also means that the array parameter cannot be const, even though the array ends up reverted back to how it was originally (assuming nothing throws an exception). Nevertheless, the code presented demonstrates the idea.

[hide source code]

template <class T, class TKey>
int SentinalSearch(T a[], int n, const TKey &find) {
    if (n == 0) return NOT_FOUND;
    --n;
    if (a[n] == find)
        return n;
    const T temp = a[n];
    a[n] = find;
    int i=0;
    while (!(a[i] == find)) ++i;
    a[n] = temp;
    return (i != n) ? i : NOT_FOUND;
}

http://www.delorie.com/gnu/docs/avl/libavl_26.html

Random searching

Random searching is not very feasible in most cases. It involves repeatedly picking a random item and examining it using the search criteria. This occasionally means faster search times, though it has some very serious potential problems if used incorrectly. For example, if no items match the search criteria, then the examination process may never finish. This can be alleviated in several ways.

1. Ensure that the item you are searching for is definitely present.
2. Relax the search criteria such that a high proportion of items match.
3. Keep track of which items you have already examined. This is the approach I've take below. Items that have been examined are moved to the end, and will not be included in future random picks.

There really aren't many places where this approach has any advantages at all, except perhaps in the field of AI, where you just want a quick match, not necessarily the best match. Computer opponents in games can be made easier/harder to beat, or at least less predictable, by using such an approach.

[hide source code]

template <class T, class TKey>
int RandomSearch(T a[], int n, const TKey &find, long seed=DEFAULT_SEED) {
    long holdrand = seed;
    while (n > 0) {
        unsigned int randNo = (((holdrand = holdrand * 214013L + 2531011L) >> 16) & 0x7fff);
        int pos = randNo % n;
        if (a[pos] == find)
            return pos;
        Swap(a[pos], a[n]);
        --n;
    }
    return NOT_FOUND;
}

Probability search

Probability searching, also known as a Self-Organising searching, seeks to make use of the fact that some items are searched for more frequently than others. The basic concept is that of caching, and this can be achieved in several ways.

When an item matches the search criteria, it is moved one step closer to the start of the array/list, or another way to do it is to move the item to the start of the array/list. The second of which is the principle used in splay trees. See http://en.wikipedia.org/wiki/Splay_tree Move-to-front is not really practical for an array, given the data moving required.

This rearranging idea improves upon the basic sequential search in that more frequent searches become faster. But it does not help in cases where the items are all looked up with equal probability. Some logic can be used to choose the best place to insert an item, such that items are not simply always appended or pre-pended, when they may be known to not have the lowest or highest probabilities respectively. Also, such a technique where items are moved around cannot be used where the order of the items is somehow significant and must not be altered

This also works about as well on linked-lists!

[hide source code]

template <class T, class TKey>
int ProbabilitySearch(T a[], int n, const TKey &find) {
    for (int i=0; i < n; ++i)
        if (a[i] == find) {
            if (i > 0) {
                Swap(a[i-1], a[i]);
                return i-1;
            }
            return i;
        }
    return NOT_FOUND;
}

http://www.cprogramming.com/discussionarticles/sorting_and_searching.html

Sorted sequential search

All of the methods discussed thus far have one major drawback. If no items match the search criteria, then all items will be examined. This can be alleviated, by sorting the records first. The algorithm then proceeds to sequentially search through the array, until an item is found whose key is not less than the one being searched for. If it is equal, you've found the item, and you can stop. Best of all though, if it is greater than the key you are searching for, then the search can be abandoned as you are assured that the rest of the array can not contain the item either.

This has the benefits of early termination, but at the cost of requiring a fairly expensive setup time to initially sort the array/list. This setup time can be rationalised by the cumulative payoffs that follow, with each search you do afterwards. The sorting time becomes less and less significant. For now we'll just assume that the records have already been pre-sorted in ascending order.

This method still requires examining roughly half of the items when a match is certain, and inserting additional items is no longer as simple as appending them.

[hide source code]

template <class T, class TKey>
int SortedSequentialSearch(const T a[], int n, const TKey &find) {
    for (int i=0; i < n; ++i)
        if (a[i] < find)
            continue;
        else if (find < a[i])
            break;
        else
            return i;
    return NOT_FOUND;
}

http://www.cprogramming.com/discussionarticles/sorting_and_searching.html

Binary search

Alright, now it's time we did something about having to search roughly half of the items don't you think? If we sorted the records and checked the middle value first, we could immediately eliminate half of the items to search through! In fact since that worked so well the first time, lets repeat that step on the half remaining, picking the item 1/4 or 3/4 of the way through the array and examining it (Another win). In fact the process can be repeated until the search space cannot be halved any longer. This is known as a binary search, and takes log (base 2) time with respect to the number of items N.

Let's look at a common implementation:

The midpoint m is, on each iteration, calculated half-way between the upper and lower bounds, and is placed into l or r accordingly (plus or minus one to skip the item m). Clearly this will halve the search space each time and give logarithmic running time.

This is the most common implementation of a binary search I've seen, but ironically it's not very efficient. Here's why:

First of all, a binary search acts just like searching a binary tree. Think of a perfectly balanced binary tree. The first level has 1 node (the root). The next level has 2 nodes (the roots direct children). The next level has 4 nodes (grandchildren), and the next has 8 nodes, and the next 16. Okay now say that's all we have in our tree. 31 nodes all up. If I pick a value at random, which level is it most likely to be on? - The bottommost level, of course! About half of the nodes (16 / 31 in fact) are on the bottommost level! 75% are on the last two levels etc. If we imagine larger trees, the same still holds. So our first test in the loop " if (a[m] == find)" is going to fail far more often than it succeeds. Sure it might be faster for those few times where the item you're looking for is examined fairly early on, but the chances of that are small, and get smaller and smaller as the number of items grows. Here's an idea, lets put that test last. It should now only be tested once each time the function is called:

That's much better. But hang on, we're just checked if it was less than or greater than, so why bother to check it it's equal. It has to be! The test can be removed altogether! Let's just make it an else.

Not bad. The same technique of eliminating half of the search space each time is the idea behind the binary tree structure and associated algorithms, but this also usually entails the use of tree-balancing algorithms. The optimisations above can just as easily be applied to a binary-tree search. Simply check whether it would be in the left sub-tree or not first, and then check if it would be in the right sub-tree, and if not, the item has been found.

However we're still often performing two comparisons each time through the loop. Were still handing the case where we find a match part way through the search, which as we know isn't very likely. It's far more likely that we'll go left or right. It would be nice if we could only do 1 comparison each time, because we don't have a language construct, or hardware instruction which can do 3-way branching, but we do have one which can perform two-way branching.

To do this requires merging the equals' case with the greater-than case (or less-than if you prefer). Then the test to see if we really found the item we were looking for can actually be shifted out of the loop.

I now provide what I believe is a near optimal binary search:

[hide source code]

template <class T, class TKey>
int BinarySearch(const T a[], int n, const TKey &find) {
    if (n > 0) {
        int l = 0, r = n-1;
        while (l < r) {
            int m = (l + r) / 2;
            if (a[m] < find)
                l = m + 1;
            else
                r = m;
        }
        if (a[l] == find)
            return l;
    }
    return NOT_FOUND;
}

We no longer exit the loop early, thus the algorithm now runs in Θ(logN) time, as opposed to having a lower bound of Ω(1) and an upper bound of O(logN) with a higher constant, so search times are more consistent. They are also faster on average because every time through the loop we eliminate half of the search space by performing only a single comparison instead of often multiple comparisons! The first version mentioned will perform 2 or 3 comparisons (average approx 2.5) every time through the loop, and a later version performed 1 or 2 (average approx 1.5) comparisons each time. Now we're down to 1. We've also removed the need to subtract 1 from m when assigning to r as an added bonus. Also, when there are duplicates in the array, we now guarantee that we'll end up finding the first occurrence of the value. So if you wanted to subsequently iterate over all the equivalent items, then you need only iterate forwards.

One thing to remember though is that the items have to be in sorted order first to use a binary search. So you may want to pick a good quick sorting algorithm to pre-sort the array. Pick something really slow and youll make a nice dent in the time savings of this technique.

Something else worth mentioning is that the calculation (l + r) / 2 will overflow when n reaches about half of the range of an int. That's about 1,073,741,823 items on a machine with 32-bit ints. This can be solved in several ways, such as changing the calculation to l + (r - l) / 2. I haven't done this above because the slightly more complex calculation is practically never necessary. Afterall, the only way such a large array would fit in memory on a 32-bit system is if it were an array of bytes, and such a case one would be better off storing the data in a more compact form (value and count, pairs). Consider making this change on a 64-bit system though, if you'll ever have more than 1,073,741,823 items in your array. See the second link below about this.

http://en.wikipedia.org/wiki/Binary_search
http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

Interpolation search

Another alternative to Binary search is to pick the items to examine more intelligently, by means of interpolation. For example if we have a high degree of certainty that the item is less than a third of the way through the array then picking the item 1/3 of the way through may allow us to eliminate two thirds of the array in 1 hit! Be weary though, as if we're wrong then we only end up eliminating one third which isn't so good. Nevertheless, interpolation allows us to 'home-in' on the approximate position of an item more quickly, provided there is an even distribution of keys. We have to be careful not to interpolate off the ends of the array however.

With the extra complexity and risk of eliminating less than a standard binary search, this method often fares about as well as a binary search.

[hide source code]

template <class T, class TKey>
int InterpSearch(const T a[], int n, const TKey &find) {
    if (n > 0) {
        int l = 0, r = n-1, m;
        while (l < r) {
            int range = a[r]-a[l];
            if (range > 0) {
                m = l + ((find-a[l])*(r-l)) / range;
                if (m < l)
                    m = l;
                else if (m >= r)
                    m = r - 1;
            } else
                m = l;
            if (a[m] < find)
                l = m + 1;
            else
                r = m;
        }
        if (a[l] == find)
            return l;
    }
    return NOT_FOUND;
}

http://en.wikipedia.org/wiki/Interpolation_search

Hopeful search

As we know, binary search is obviously faster than a linear search, but it does require a sorted array, and sorting that array does take quite a while. If the array is unsorted and we're only doing one search then a linear search is faster. But if we're doing lots of searches, then sorting the array first pays for itself in no time. Not long ago I had an interesting thought. Is there a way we take advantage of any presortedness in an array to get O(log n) search time and yet also manage to complete in O(n) time if the data is not sorted, and without actually sorting it? The answer it turns out is, yes...
The idea is based on a binary search, but done recursively, and if a recursive call fails to find the item, then you recurse on the other branch anyway, in case the reason the item was not found is that the array isn't actually sorted. The catch is that if the array was sorted but the item was not found then you still end up taking O(n) search time.
There are however coding scenarios where you know that the item will always be present, and in such a situation, this algorithm allows you to add new items to the end of the array at any time, and only sort them into the rest of the array when you want to. In the mean time searches will still find these new items even though it will take longer for those items, by using this algorithm. Okay it's a bit of a stretch to imagine the case where it would be useful, but you never know.

[hide source code]

template <class T, class TKey>
int HopefulSearchAux(const T a[], int n, const TKey &find) {
    if (n > 0) {
        const int m = n>>1; //midpoint
        if (a[m] < find) {
            const Size pos1 = HopefulSearchAux(&a[m+1], n-m-1, find);
            if (pos1 == NOT_FOUND) {
                const Size pos2 = HopefulSearchAux(a, m, find);
                if (pos2 == NOT_FOUND)
                    return NOT_FOUND;
                return pos2;
            }
            return m+1 + pos1;
        } else {
            const int pos1 = HopefulSearchAux(a, m, find);
            if (pos1 == NOT_FOUND) {
                const Size pos2 = HopefulSearchAux(&a[m+1], n-m-1, find);
                if (pos2 == NOT_FOUND)
                    return NOT_FOUND;
                return m+1 + pos2;
            }
            return pos1;
        }
    } else if (a[0] == find)
        return 0;
    else
        return NOT_FOUND;
}

template
<class T, class TKey>
int HopefulSearch(const T a[], int n, const TKey &find) {
    return HopefulSearchAux(a, n-1, find);
}

Hashed search

O(log(n)) searching time is not to be sneezed at, but can we do better? Well, yes I suppose we can, but not by performing comparisons.

You can't continually cut the amount of work by more than half each time by simply performing comparisons. To do any better we need to resort to hashing.

Hashing is a vast subject in itself. You can even do perfect hashing if you have plenty of setup time to spare. I won't go into any details here, but I'll just mention that my preferred hash table implementation uses a CRC algorithm for the hash, a power of two for the table size, and the separate-chaining method to further reduce collisions, and so that items can be easily removed. FNV-1a is also pretty fast and becoming rather popular.

Hashing basically is a means of taking you straight to (or pretty close to) the item you are looking for, by calculating exactly where it should be located. Typically only a couple of comparisons are required for each search. However a hash table can suffer from collisions, which in great enough numbers can degrade performance significantly. This would normally be made extremely unlikely though.

In rare cases a binary tree can still beat a good hash-table. For example, long text strings for keys. A hash table would probably calculate a hash over the whole string, whereas a binary tree would most often be able to branch on the first character or few thereafter. It depends on how different the string prefixes are. If the strings usually differ by the first letter, then a strcmp would be very fast.

http://en.wikipedia.org/wiki/Hash_table
http://burtleburtle.net/bob/hash/examhash.html

Reluctant search

This may seem a bit odd, but some people out there actually sought to find the most innocently looking but horrendously inefficient searching algorithm possible. Theyve done pretty well in their invention of Reluctant-Search, even though it apparently only examines each item once. Like a binary search, it requires that the contents first be sorted. Heres my implementation of this abomination:

[hide source code]

template <class T, class TKey>
int ReluctantSearchAux(const T a[], int i, int j, const TKey &find) {
    if (i > j) return NOT_FOUND;
    if (i == j) return (a[i] == find) ? i : NOT_FOUND;
    int m = (i + j) >> 1;
    if (a[m] < find) {
        int k = ReluctantSearchAux(a, i, m, find);
        return (k == NOT_FOUND) ? ReluctantSearchAux(a, m+1, j, find) : k;
    } else {
        int k = ReluctantSearchAux(a, m+1, j, find);
        return (k == NOT_FOUND) ? ReluctantSearchAux(a, i, m, find) : k;
    }
}

template
<class T, class TKey>
int ReluctantSearch(const T a[], int n, const TKey &find) {
    return ReluctantSearchAux(a, 0, n-1, find);
}

That doesn't look too bad now does it? Looks can be (very) deceiving - It is very very slow!
So hey, if youve ever written a hideously inefficient algorithm in the past, it would be a surely outstanding achievement for it to run as slow as Reluctant Search.

http://mipmip.org/tidbits/pasa.pdf
http://www.cs.uiuc.edu/class/fa05/cs473ug/resources/pessimal-algorithms.pdf

Recommendations

Well that about wraps it up. If you feel Ive left anything important and relevant out, please let me know. In conclusion, for data that is already kept sorted, perhaps for other reasons, a binary search is probably the way to go. On the other hand, if you don't care about items being in order, and are doing lots of searches, or you have a bit of spare memory anyway, a hash-table can give the best performance. However, if you only ever have a small number of items (e.g. 10), just pick the simplest approach of a sequential search.