Daily Archives: May 26, 2015

Weighted Reservoir Sampling

Based on last Reservoir Sampling problem, I will develop it a little bit more. What if each element has a weight. Let’s say we have below elements. The struture is: [id, weight]:
e0=[0, 2], e1=[1, 1], e2=[2, 1], e3=[3, 1], e4=[4, 1].

And we want to sample 2 of them. Let’s compute the possibilite that e0 will be sampled. Out of 5 elements, the possibility that first time we choose e0 is 2/(2+1+1+1+1) = 2/6. And we have chance that the first time we didn’t choose e0, it is 4/6. And the possibility that we choose e0 at 2nd time is (4/6)*(2/5). So the total possibility we choose e0 is 2/6 + (4/6)*(2/5).

Above should be the result of the sampling. But how do we handle this problem with weight?  Actually, this is a weighted reservoir sampling problem which is described in this paper http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf.

The idea is that we compute a value ki of element i, by combining weight wi and a random value ui. The ki will be between (0,…,1):

And we read from stream, keep top m ki value of element by using heap.

package com.pli.project.algorithm.probability;

import java.util.*;

/**
 * Created by lipeng on 2015/5/25.
 */
public class WeightedReservoirSampling {

    public static int[] weightedReservoirSampling(int[][] items, int m){
        if(items==null || m > items.length){
            System.out.println("incorrect input.");
            return null;
        }
        PriorityQueue<WrsElement> heap = new PriorityQueue<WrsElement>(10);
        /** Transform weight into a double between (0,1). Put it in min heap. **/
        for(int i=0; i < items.length ; i++){
            double key = Math.pow((Math.random()), 1/(double)items[i][1]);
            WrsElement element = new WrsElement(items[i][0], key);
            heap.add(element);
            if(heap.size() > m){
                heap.poll();
            }
        }
        /** Output result. **/
        int[] result = new int[m];
        for(int i=0;i<m; i++){
            result[i] = heap.poll().value;
        }
        return result;
    }

    static class WrsElement implements Comparable<WrsElement>{

        int value;
        double key;

        public WrsElement(int value, double key){
            this.value = value;
            this.key = key;
        }

        public int compareTo(WrsElement wrsElement) {
            return Double.compare(this.key, wrsElement.key);
        }

        @Override
        public String toString() {
            return "WrsElement{" +
                    "value=" + value +
                    ", key=" + key +
                    '}';
        }
    }

    public static void main(String[] args) {
        /** each item = {id, weight} **/
        int[][] items = {
                {1, 1}, {2, 5}, {3, 20}, {4, 5}, {5, 10}, {6, 3}, {7, 3}, {8, 3}
        };
        int[] result = weightedReservoirSampling(items, 3);
        System.out.println(Arrays.toString(result));
    }

    public static void test(){
        int[][] items = {{0, 2}, {1, 1}, {2, 1}, {3, 1}, {4, 1} };
        int m = 2;
        int[] probabilities = new int[items.length];
        for(int i=0; i< 1000; i++) {
            int[] result = weightedReservoirSampling(items, m);
            for(int j = 0; j < m; j++){
                probabilities[result[j]]++;
            }
        }
        System.out.println(Arrays.toString(probabilities));
    }
}

Here are some other referene to learn:
http://gregable.com/2007/10/reservoir-sampling.html
http://stackoverflow.com/questions/17117872/is-there-an-algorithm-for-weighted-reservoir-sampling

Reservoir Sampling

I have a post about how to return a value by it’s weight by inputing a stream. http://allenlipeng47.com/PersonalPage/index/view/129/nkey

Based on that, I will develop this to another problem: Reservoir Sampling.

The probelm of Reservoir Sampling is like this. There are elements inputted by stream, which lenght is n. How can we uniformly choose m out of n elements. We assume n is too large to fit in memory.

The solution is similar to the previous post. We define an ResultArray of length m. At beginning, we push [0,…,m-1] elements to ResultArray. Then loop element from [m,…,n-1]. For element i, we generate a random number with range [0,…,i-1]. If random number is less than m, then we put element i to ResultArray[i]. Otherwise, we continue the loop.

package com.pli.project.algorithm.probability;

import java.util.Arrays;

/**
 * Created by lipeng on 2015/5/25.
 */
public class ReservoirSampling {

    /**
     * Suppose arr is a very large array, can't be stored in memory.
     * Randomly select m element among arr.
     * @param arr, the input array. Very large, can't be stored in memory.
     * @param m, the size of reservoir
     * @return, the sampling result.
     */
    public static int[] reservoirSampling(int[] arr, int m){
        if(arr==null || m > arr.length){
            return null;
        }
        int[] result = Arrays.copyOf(arr, m);
        for(int i = m; i < arr.length; i++){
            int randPos = (int)(Math.random() * (i + 1));
            if(randPos < m){
                result[randPos] = arr[i];
            }
        }
        return result;
    }

    public static void main(String[] args) {
        int[] arr = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        int[] result = reservoirSampling(arr, 3);
        System.out.println(Arrays.toString(result));
    }

}