Rash thoughts about .NET, C#, F# and Dynamics NAV.


"Every solution will only lead to new problems."

Thursday, 29. January 2009


n-dimensional k-means clustering with F#

Filed under: BioInformatik,English posts,F#,Informatik — Steffen Forkmann at 11:38 Uhr

The K-means-Algorithm is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. In this article I will show how we can implement this in F#.

First of all we define an interface for “clusterable” objects:

type IClusterable =

  abstract DimValues: float array with get

  abstract Dimensions: int

The next step is to define a distance function. We will use n-dimensional Euclidean distance here.

Euclidean distance

let calcDist (p1:IClusterable) (p2:IClusterable) =

  let sq x = x * x

  if p1.Dimensions <> p2.Dimensions then

    failwith "Cluster dimensions aren’t equal."

 

  p2.DimValues

    |> Array.fold2

        (fun acc x y -> x – y |> sq |> (+) acc)

        0. p1.DimValues

   |> sqrt 

Now we define a n-dimensional Centroid type. Our kMeans-Algorithm will minimize the squared distances to k centroids (or “means”).

type Centroid =

  {Values: float array;

   dimensions: int}

  member x.Print = printfn "%A" x.Values

  interface IClusterable with

    member x.DimValues = x.Values   

    member x.Dimensions = x.dimensions   

The centroid of a finite set of n-dimensional points x1, x2, …, xn is calculated as

image

let calcCentroid (items:’a list when ‘a :> IClusterable)

     (oldCentroid:Centroid) =

  let count = items.Length

  if count = 0 then oldCentroid else

  let mValues =    

    [|for d in 0..oldCentroid.dimensions-1 do

        let sum =

          items

            |> List.sumBy (fun item -> item.DimValues.[d])

 

        yield sum / (count |> float)|]

 

  { Values = mValues;

    dimensions = oldCentroid.dimensions}

We made a small modification – if we don’t have any items assigned to a cluster the centroid won’t change. This is important for the robustness of the algorithm.

Now we can calculate the squared errors to a centroid:

let calcError centroid (items:’a list when ‘a :> IClusterable) =

  let calcDiffToCentroid (item:’a) =

    centroid.Values

      |> Array.fold2

          (fun acc c i -> acc + (c-i)*(c-i))

          0. item.DimValues

 

  items

    |> List.sumBy calcDiffToCentroid 

For storing cluster information we create a cluster type:

type Cluster<‘a> when ‘a :> IClusterable =

  {Items: ‘a list;

   Count: int;

   Centroid: Centroid;

   Error: float}

 

  member x.Print =

     printfn "(Items: %d, Centroid: %A, Error: %.2f)"

       x.Count x.Centroid x.Error

 

  static member CreateCluster dimensions (item:’a) =

    let items = [item]

    let empty =

      { Values = [||];

        dimensions = dimensions}

 

    let centroid = calcCentroid items empty

    { Items = items;

      Count = 1;

      Centroid = centroid;

      Error = calcError centroid items}

 

  member x.EvolveCluster (items:’a list) =

    let l = items.Length

    let centroid = calcCentroid items x.Centroid

    {x with

      Items = items;

      Count = l;

      Centroid = centroid;

      Error = calcError centroid items} 

Now we need a function to initialize the clusters and a function to assign a items to the nearest cluster:

open System

 

let rand = new Random() 

 

let InitClusters (k:int) dimensions

      (items:’a array when ‘a :> IClusterable)  =

  let length = items.Length

  let getInitItem() = items.[rand.Next length]

  Array.init k (fun _ ->

                  getInitItem()

                    |> Cluster<‘a>.CreateCluster dimensions) 

 

let AssignItemsToClusters k dimensions (clusters:Cluster<‘a> array)

      (items:’a seq when ‘a :> IClusterable) =

  if k <= 0 then failwith "KMeans needs k > 0"  

  let findNearestCluster item =

    let minDist,nearest,lastPos =

      clusters

        |> Array.fold

            (fun (minDist,nearest,pos) cluster ->

               let distance = calcDist item (cluster.Centroid)

               if distance < minDist then

                 (distance,pos,pos+1)

               else

                 (minDist,nearest,pos+1))

            (Double.PositiveInfinity,0,0) 

    nearest

 

  let assigned =

    items

      |> Seq.map (fun item -> item,findNearestCluster item)

 

  let newClusters = Array.create k []  

  for item,nearest in assigned do        

    newClusters.[nearest] <- item::(newClusters.[nearest])

 

  clusters

    |> Array.mapi (fun i c -> c.EvolveCluster newClusters.[i])   

The last step is a function which calculates the squared error over all clusters:

let calcClusterError (clusters:Cluster<‘a> array) =

  clusters

    |> Array.sumBy (fun cluster -> cluster.Error)

Now it is an easy task to write the kMeans algorithm:

let kMeansClustering K dimensions epsilon

       (items:’a array when ‘a :> IClusterable) =

  let k = if K <= 0 then 1 else K

  let rec clustering lastError (clusters:Cluster<‘a> array) =

    let newClusters =

      AssignItemsToClusters k dimensions clusters items

    let newError = calcClusterError newClusters

 

    if abs(lastError – newError) > epsilon then       

      clustering newError newClusters

    else

      newClusters,newError   

 

  InitClusters k dimensions items

    |> clustering Double.PositiveInfinity 

We can test this algorithm with random 2-dimensional points:

type Point =

  {X:float; Y: float} 

  member x.Print = printfn "(%.2f,%.2f)" x.X x.Y

  interface IClusterable with

    member x.DimValues = [| x.X; x.Y |]

    member x.Dimensions = 2  

 

let point x y = {X = x; Y = y}

let getRandomCoordinate() = rand.NextDouble() – 0.5 |> (*) 10. 

let randPoint _ = point (getRandomCoordinate()) (getRandomCoordinate())    

let points n = Array.init n randPoint

let items = points 1000

 

printfn "Items:"

items |> Seq.iter (fun i -> i.Print)

 

let clusters,error = kMeansClustering 3 2 0.0001 items

printfn "\n"

clusters |> Seq.iter (fun c -> c.Print)   

printfn "Error: %A" error   

Next time I will show how we can simplify the code by using F#’s built-in type vector instead of array.

Tags: , , , ,

6 Comments »

  1. Wow, really cool post. I will put the code in an editor and play around with it. Have you thought about other machine learning constructs in F#? I think there is a lot of interesting and untouched territory.

    Comment by Chance Coble — Friday, 30. January 2009 um 4:11 Uhr

  2. Actually, as I look around your blog it looks like you have done quite a bit of this kind of work in F#. Very interesting stuff! Thanks for posting it.

    Comment by Chance Coble — Friday, 30. January 2009 um 4:19 Uhr

  3. Hi,

    I implemented a hierarical BottomUpClustering in F#. Maybe I will show this in one of my next posts.

    Regards,
    Steffen

    Comment by Steffen Forkmann — Friday, 30. January 2009 um 8:41 Uhr

  4. Steffen,

    I discovered your Cluster Analysis code in a web search. My biology work involves data analysis and I’ve coded Cluster Analysis routines in Fortran and another 4GL language. I wanted to run a timing comparison with your code, however I was receiving syntax errors.

    I changed:
    – invalid_arg to a printf
    – fold_left2 to fold2

    I still have a compile error on:

    {new Centroid with
    Values = mValues and
    dimensions = oldCentroid.dimensions}


    I’m using Visual Studio 2010 Beta and that might be the reason for the syntax change. If I figure it out, I’ll let you know. Although I proficient in Visual Studios, I’m new to F# and don’t full understand all the syntax.

    Good work,
    Don

    Comment by Donald — Monday, 15. February 2010 um 18:55 Uhr

  5. Hi Don,

    thanks for the feedback. I updated the code and it should now work with the February CTP.

    Regards,
    Steffen

    Comment by Steffen Forkmann — Monday, 15. February 2010 um 22:22 Uhr

  6. Very fine post. I believe F# can have a nice role in datamining etc..

    Comment by nicolas — Wednesday, 2. May 2012 um 9:18 Uhr

RSS feed for comments on this post. | TrackBack URI

Leave a comment

XHTML ( You can use these tags): <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> .