The K-means-Algorithm is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. In this article I will show how we can implement this in F#.
First of all we define an interface for “clusterable” objects:
type IClusterable =
abstract DimValues: float array with get
abstract Dimensions: int
The next step is to define a distance function. We will use n-dimensional Euclidean distance here.
let calcDist (p1:IClusterable) (p2:IClusterable) =
let sq x = x * x
if p1.Dimensions <> p2.Dimensions then
failwith "Cluster dimensions aren’t equal."
p2.DimValues
|> Array.fold2
(fun acc x y -> x – y |> sq |> (+) acc)
0. p1.DimValues
|> sqrt
Now we define a n-dimensional Centroid type. Our kMeans-Algorithm will minimize the squared distances to k centroids (or “means”).
type Centroid =
{Values: float array;
dimensions: int}
member x.Print = printfn "%A" x.Values
interface IClusterable with
member x.DimValues = x.Values
member x.Dimensions = x.dimensions
The centroid of a finite set of n-dimensional points x1, x2, …, xn is calculated as
let calcCentroid (items:’a list when ‘a :> IClusterable)
(oldCentroid:Centroid) =
let count = items.Length
if count = 0 then oldCentroid else
let mValues =
[|for d in 0..oldCentroid.dimensions-1 do
let sum =
items
|> List.sumBy (fun item -> item.DimValues.[d])
yield sum / (count |> float)|]
{ Values = mValues;
dimensions = oldCentroid.dimensions}
We made a small modification – if we don’t have any items assigned to a cluster the centroid won’t change. This is important for the robustness of the algorithm.
Now we can calculate the squared errors to a centroid:
let calcError centroid (items:’a list when ‘a :> IClusterable) =
let calcDiffToCentroid (item:’a) =
centroid.Values
|> Array.fold2
(fun acc c i -> acc + (c-i)*(c-i))
0. item.DimValues
items
|> List.sumBy calcDiffToCentroid
For storing cluster information we create a cluster type:
type Cluster<‘a> when ‘a :> IClusterable =
{Items: ‘a list;
Count: int;
Centroid: Centroid;
Error: float}
member x.Print =
printfn "(Items: %d, Centroid: %A, Error: %.2f)"
x.Count x.Centroid x.Error
static member CreateCluster dimensions (item:’a) =
let items = [item]
let empty =
{ Values = [||];
dimensions = dimensions}
let centroid = calcCentroid items empty
{ Items = items;
Count = 1;
Centroid = centroid;
Error = calcError centroid items}
member x.EvolveCluster (items:’a list) =
let l = items.Length
let centroid = calcCentroid items x.Centroid
{x with
Items = items;
Count = l;
Centroid = centroid;
Error = calcError centroid items}
Now we need a function to initialize the clusters and a function to assign a items to the nearest cluster:
open System
let rand = new Random()
let InitClusters (k:int) dimensions
(items:’a array when ‘a :> IClusterable) =
let length = items.Length
let getInitItem() = items.[rand.Next length]
Array.init k (fun _ ->
getInitItem()
|> Cluster<‘a>.CreateCluster dimensions)
let AssignItemsToClusters k dimensions (clusters:Cluster<‘a> array)
(items:’a seq when ‘a :> IClusterable) =
if k <= 0 then failwith "KMeans needs k > 0"
let findNearestCluster item =
let minDist,nearest,lastPos =
clusters
|> Array.fold
(fun (minDist,nearest,pos) cluster ->
let distance = calcDist item (cluster.Centroid)
if distance < minDist then
(distance,pos,pos+1)
else
(minDist,nearest,pos+1))
(Double.PositiveInfinity,0,0)
nearest
let assigned =
items
|> Seq.map (fun item -> item,findNearestCluster item)
let newClusters = Array.create k []
for item,nearest in assigned do
newClusters.[nearest] <- item::(newClusters.[nearest])
clusters
|> Array.mapi (fun i c -> c.EvolveCluster newClusters.[i])
The last step is a function which calculates the squared error over all clusters:
let calcClusterError (clusters:Cluster<‘a> array) =
clusters
|> Array.sumBy (fun cluster -> cluster.Error)
Now it is an easy task to write the kMeans algorithm:
let kMeansClustering K dimensions epsilon
(items:’a array when ‘a :> IClusterable) =
let k = if K <= 0 then 1 else K
let rec clustering lastError (clusters:Cluster<‘a> array) =
let newClusters =
AssignItemsToClusters k dimensions clusters items
let newError = calcClusterError newClusters
if abs(lastError – newError) > epsilon then
clustering newError newClusters
else
newClusters,newError
InitClusters k dimensions items
|> clustering Double.PositiveInfinity
We can test this algorithm with random 2-dimensional points:
type Point =
{X:float; Y: float}
member x.Print = printfn "(%.2f,%.2f)" x.X x.Y
interface IClusterable with
member x.DimValues = [| x.X; x.Y |]
member x.Dimensions = 2
let point x y = {X = x; Y = y}
let getRandomCoordinate() = rand.NextDouble() – 0.5 |> (*) 10.
let randPoint _ = point (getRandomCoordinate()) (getRandomCoordinate())
let points n = Array.init n randPoint
let items = points 1000
printfn "Items:"
items |> Seq.iter (fun i -> i.Print)
let clusters,error = kMeansClustering 3 2 0.0001 items
printfn "\n"
clusters |> Seq.iter (fun c -> c.Print)
printfn "Error: %A" error
Next time I will show how we can simplify the code by using F#’s built-in type vector instead of array.
Tags: Centroid, Clustering, F#, k-means, KMeans
Wow, really cool post. I will put the code in an editor and play around with it. Have you thought about other machine learning constructs in F#? I think there is a lot of interesting and untouched territory.
Comment by Chance Coble — Friday, 30. January 2009 um 4:11 Uhr
Actually, as I look around your blog it looks like you have done quite a bit of this kind of work in F#. Very interesting stuff! Thanks for posting it.
Comment by Chance Coble — Friday, 30. January 2009 um 4:19 Uhr
Hi,
I implemented a hierarical BottomUpClustering in F#. Maybe I will show this in one of my next posts.
Regards,
Steffen
Comment by Steffen Forkmann — Friday, 30. January 2009 um 8:41 Uhr
Steffen,
I discovered your Cluster Analysis code in a web search. My biology work involves data analysis and I’ve coded Cluster Analysis routines in Fortran and another 4GL language. I wanted to run a timing comparison with your code, however I was receiving syntax errors.
I changed:
– invalid_arg to a printf
– fold_left2 to fold2
I still have a compile error on:
{new Centroid with
Values = mValues and
dimensions = oldCentroid.dimensions}
I’m using Visual Studio 2010 Beta and that might be the reason for the syntax change. If I figure it out, I’ll let you know. Although I proficient in Visual Studios, I’m new to F# and don’t full understand all the syntax.
Good work,
Don
Comment by Donald — Monday, 15. February 2010 um 18:55 Uhr
Hi Don,
thanks for the feedback. I updated the code and it should now work with the February CTP.
Regards,
Steffen
Comment by Steffen Forkmann — Monday, 15. February 2010 um 22:22 Uhr
Very fine post. I believe F# can have a nice role in datamining etc..
Comment by nicolas — Wednesday, 2. May 2012 um 9:18 Uhr