2014-11-19

## 如何用MapReduce的思路实现Kmeans

Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce[M]//Cloud Computing. Springer Berlin Heidelberg, 2009: 674-679.

1、首先从这n个对象中随机选择k个对象作为初始的k个簇的中心（就叫做“簇心”吧）；

2、然后将其余的对象分到最近的簇心，如此k个簇就出来了；

3、之后对每个簇，求新的簇心（基本方法是将属于该簇的所有点放在一起求平均值）；

4、重复2、3步（一轮2、3步，可以叫做一次迭代），直到所有的簇心基本不再变化，或者达到指定的重复次数；

5、和第2步相同。

### Map

**Input:** Global variable centers, the offset key, the sample value
**Output:** <key’, value’> pair, where the key’ is the index of the closest  center point and value’ is a string comprise of sample information

1. Construct the sample instance from value;
2. minDis = Double.MAX VALUE ;
3. index = -1;
4. For i=0 to centers.length do
dis= ComputeDist(instance, centers[i]);
If dis < minDis {
minDis = dis;
index = i;
}
5. End For
6. Take index as key’;
7. Construct value’ as a string comprise of the values of different dimensions;
8. output < key , value > pair;
9. End


### Combine

**Input:** key is the index of the cluster, V is the list of the samples assigned to the same cluster
**Output:** < key , value > pair, where the key’ is the index of the cluster, value’ is a string comprised of sum of the samples in the same cluster and the sample number

1. Initialize one array to record the sum of value of each dimensions of the samples contained in the same cluster, i.e. the samples in the list V ;
2. Initialize a counter num as 0 to record the sum of sample number in the same cluster;
3. while(V.hasNext()){
Construct the sample instance from V.next();
Add the values of different dimensions of instance to the array
num++;
4. }
5. Take key as key’;
6. Construct value’ as a string comprised of the sum values of different dimensions and num;
7. output < key , value > pair;
8. End


### Reduce

**Input: **key is the index of the cluster, V is the list of the partial sums from different host
**Output:** < key , value > pair, where the key’ is the index of the cluster, value’ is a string repre-
senting the new center
1. Initialize one array record the sum of value of each dimensions of the samples contained in the
same cluster, e.g. the samples in the list V ;
2. Initialize a counter NUM as 0 to record the sum of sample number in the same cluster;
3. while(V.hasNext()){
Construct the sample instance from V.next();
Add the values of different dimensions of instance to the array
NUM += num;
4. }
5. Divide the entries of the array by NUM to get the new center’s coordinates; // 求簇心
6. Take key as key’;
7. Construct value’ as a string comprise of the center ’s coordinates;
8. output < key , value > pair;
9. End


## 编程实现Kmeans

Kmeans包含了很多了顺序执行的MapReduce job，每次MapReduce job会产生新的簇心的信息并保存到HDFS中，这些新的簇心通过org.apache.hadoop.filecache.DistributedCache共享到下一次MapReduce job的Map Task中。