BioInformatik » Rash thoughts about .NET, C#, F# and Dynamics NAV.

Saturday, 31. January 2009

Jan 31

n-dimensional k-means clustering with F# – part II

Filed under: BioInformatik,English posts,F#,Informatik — Steffen Forkmann at 11:28 Uhr

In the last post I show how we can calculate clusters of n-dimensional Objects with the K-means-Algorithm in F#. This time I will simplify the code by using the built-in type vector.

First of all we redefine the interface for “clusterable” objects:

type IClusterable =
  abstract DimValues: vector with get

This time we do not need any dimension information – the vector type will care about this. Now we can reduce our distance function to:

let calcDist (p1:vector) (p2:vector) = (p1 - p2).Norm

The function Norm calculates the 2-Norm of a vector.

2-Norm ==> Eulidean distance

So calcDist will still compute the n-dimensional Euclidean distance.

The next step is the definition of a n-dimensional Centroid type. This time we will only use a type alias of vector:

type Centroid = vector

The calculation of the centroid and the centroid error can be written as:

let calcCentroid (items:'a list when 'a :> IClusterable)
     (oldCentroid:Centroid) =
  let count = items.Length
  if count = 0 then oldCentroid else  
  let sum =
    items
      |> List.fold_left 
        (fun vec item -> vec + item.DimValues) // vector addition
        (nullVector oldCentroid)
   
  sum * (1./(count |> float))  // multiply with scalar
     

let calcError centroid (items:'a list when 'a :> IClusterable) =
  let sq x = x * x

  items
    |> List.sum_by (fun i -> calcDist centroid i.DimValues |> sq)

The calcError function is straight forward but not perfect in terms of performance. As we use the calcDist function and hence the 2-Norm the function will first calculate a square root and then squares the result.

The rest of the code is pretty much same as in the last post:

/// creates n-dimensional 0 - vector
let nullVector (v:vector) = v * 0.
    
type Cluster<'a> when 'a :> IClusterable =
  {Items: 'a list;
   Count: int;
   Centroid: Centroid;
   Error: float}

  member x.Print =
     printfn "(Items: %d, Centroid: %A, Error: %.2f)"
       x.Count x.Centroid x.Error

  static member CreateCluster (item:'a) =
    let items = [item]
    let empty = nullVector item.DimValues

    let centroid = calcCentroid items empty
    {new Cluster<'a> with
      Items = items and
      Count = 1 and
      Centroid = centroid and
      Error = calcError centroid items}

  member x.EvolveCluster (items:'a list) =
    let l = items.Length
    let centroid = calcCentroid items x.Centroid
    {x with
      Items = items;
      Count = l;
      Centroid = centroid;
      Error = calcError centroid items}

let rand = new Random()  

let InitClusters (k:int)
      (items:'a array when 'a :> IClusterable)  =
  let length = items.Length
  let getInitItem() = items.[rand.Next length]
  Array.init k (fun x ->
                  getInitItem()
                    |> Cluster<'a>.CreateCluster)
                    
                    
let AssignItemsToClusters k (clusters:Cluster<'a> array)
      (items:'a seq when 'a :> IClusterable) =
  if k <= 0 then failwith "KMeans needs k > 0"
  let findNearestCluster (item:'a) =
    let minDist,nearest,lastPos =
      clusters
        |> Array.fold_left
            (fun (minDist,nearest,pos) cluster ->
               let distance = calcDist item.DimValues (cluster.Centroid)
               if distance < minDist then
                 (distance,pos,pos+1)
               else
                 (minDist,nearest,pos+1))
            (Double.PositiveInfinity,0,0)
    nearest

  let assigned =
    items
      |> Seq.map (fun item -> item,findNearestCluster item)

  let newClusters = Array.create k []
  for item,nearest in assigned do
    newClusters.[nearest] <- item::(newClusters.[nearest])

  clusters
    |> Array.mapi (fun i c -> c.EvolveCluster newClusters.[i])  

    
let calcClusterError (clusters:Cluster<'a> array) =
  clusters
    |> Array.fold_left (fun acc cluster -> acc + cluster.Error) 0.  
                      
let kMeansClustering K epsilon
       (items:'a array when 'a :> IClusterable) =
  let k = if K <= 0 then 1 else K
  let rec clustering lastError (clusters:Cluster<'a> array) =
    let newClusters =
      AssignItemsToClusters k clusters items
    let newError = calcClusterError newClusters

    if abs(lastError - newError) > epsilon then
      clustering newError newClusters
    else
      newClusters,newError    

  InitClusters k items
    |> clustering Double.PositiveInfinity

This vector approach shortens the code for the kMeans algorithm, but has a little longer runtime. I am interested in any suggestions.

Tags: Centroid, Clustering, k-means, KMeans, vector

Feedback:

Comments (0)

Thursday, 29. January 2009

Jan 29

n-dimensional k-means clustering with F#

Filed under: BioInformatik,English posts,F#,Informatik — Steffen Forkmann at 11:38 Uhr

The K-means-Algorithm is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. In this article I will show how we can implement this in F#.

First of all we define an interface for “clusterable” objects:

type IClusterable =
  abstract DimValues: float array with get 
  abstract Dimensions: int

The next step is to define a distance function. We will use n-dimensional Euclidean distance here.

let calcDist (p1:IClusterable) (p2:IClusterable) =
  let sq x = x * x
  if p1.Dimensions <> p2.Dimensions then
    failwith "Cluster dimensions aren’t equal."
 
  p2.DimValues
    |> Array.fold2
        (fun acc x y -> x – y |> sq |> (+) acc) 
        0. p1.DimValues 
   |> sqrt  

Now we define a n-dimensional Centroid type. Our kMeans-Algorithm will minimize the squared distances to k centroids (or “means”).

type Centroid =
  {Values: float array;
   dimensions: int}
  member x.Print = printfn "%A" x.Values 
  interface IClusterable with
    member x.DimValues = x.Values    
    member x.Dimensions = x.dimensions    

The centroid of a finite set of n-dimensional points x₁, x₂, …, x_n is calculated as

let calcCentroid (items:’a list when ‘a :> IClusterable)
     (oldCentroid:Centroid) =
  let count = items.Length
  if count = 0 then oldCentroid else
  let mValues =     
    [|for d in 0..oldCentroid.dimensions-1 do
        let sum =
          items 
            |> List.sumBy (fun item -> item.DimValues.[d])
 
        yield sum / (count |> float)|]
 
  { Values = mValues;
    dimensions = oldCentroid.dimensions} 

We made a small modification – if we don’t have any items assigned to a cluster the centroid won’t change. This is important for the robustness of the algorithm.

Now we can calculate the squared errors to a centroid:

let calcError centroid (items:’a list when ‘a :> IClusterable) = 
  let calcDiffToCentroid (item:’a) =
    centroid.Values
      |> Array.fold2 
          (fun acc c i -> acc + (c-i)*(c-i)) 
          0. item.DimValues
 
  items
    |> List.sumBy calcDiffToCentroid  

For storing cluster information we create a cluster type:

type Cluster<‘a> when ‘a :> IClusterable =
  {Items: ‘a list;
   Count: int;
   Centroid: Centroid;
   Error: float}
 
  member x.Print = 
     printfn "(Items: %d, Centroid: %A, Error: %.2f)" 
       x.Count x.Centroid x.Error
 
  static member CreateCluster dimensions (item:’a) =
    let items = [item]
    let empty = 
      { Values = [||];
        dimensions = dimensions}
 
    let centroid = calcCentroid items empty
    { Items = items;
      Count = 1;
      Centroid = centroid;
      Error = calcError centroid items}
 
  member x.EvolveCluster (items:’a list) = 
    let l = items.Length
    let centroid = calcCentroid items x.Centroid
    {x with 
      Items = items;
      Count = l;
      Centroid = centroid;
      Error = calcError centroid items}  

Now we need a function to initialize the clusters and a function to assign a items to the nearest cluster:

open System
 
let rand = new Random()  
 
let InitClusters (k:int) dimensions 
      (items:’a array when ‘a :> IClusterable)  =
  let length = items.Length
  let getInitItem() = items.[rand.Next length]
  Array.init k (fun _ -> 
                  getInitItem() 
                    |> Cluster<‘a>.CreateCluster dimensions)  
 
let AssignItemsToClusters k dimensions (clusters:Cluster<‘a> array)
      (items:’a seq when ‘a :> IClusterable) = 
  if k <= 0 then failwith "KMeans needs k > 0"   
  let findNearestCluster item =
    let minDist,nearest,lastPos =
      clusters
        |> Array.fold
            (fun (minDist,nearest,pos) cluster ->
               let distance = calcDist item (cluster.Centroid)
               if distance < minDist then 
                 (distance,pos,pos+1) 
               else 
                 (minDist,nearest,pos+1))
            (Double.PositiveInfinity,0,0)  
    nearest
 
  let assigned =
    items 
      |> Seq.map (fun item -> item,findNearestCluster item) 
 
  let newClusters = Array.create k []   
  for item,nearest in assigned do         
    newClusters.[nearest] <- item::(newClusters.[nearest])
 
  clusters
    |> Array.mapi (fun i c -> c.EvolveCluster newClusters.[i])    

The last step is a function which calculates the squared error over all clusters:

let calcClusterError (clusters:Cluster<‘a> array) =
  clusters
    |> Array.sumBy (fun cluster -> cluster.Error)

Now it is an easy task to write the kMeans algorithm:

let kMeansClustering K dimensions epsilon 
       (items:’a array when ‘a :> IClusterable) =
  let k = if K <= 0 then 1 else K
  let rec clustering lastError (clusters:Cluster<‘a> array) =
    let newClusters = 
      AssignItemsToClusters k dimensions clusters items
    let newError = calcClusterError newClusters
 
    if abs(lastError – newError) > epsilon then        
      clustering newError newClusters
    else
      newClusters,newError    
 
  InitClusters k dimensions items
    |> clustering Double.PositiveInfinity  

We can test this algorithm with random 2-dimensional points:

type Point =
  {X:float; Y: float}  
  member x.Print = printfn "(%.2f,%.2f)" x.X x.Y
  interface IClusterable with
    member x.DimValues = [| x.X; x.Y |]
    member x.Dimensions = 2   
 
let point x y = {X = x; Y = y}
let getRandomCoordinate() = rand.NextDouble() – 0.5 |> (*) 10.  
let randPoint _ = point (getRandomCoordinate()) (getRandomCoordinate())     
let points n = Array.init n randPoint
let items = points 1000
 
printfn "Items:"
items |> Seq.iter (fun i -> i.Print)
 
let clusters,error = kMeansClustering 3 2 0.0001 items
printfn "\n"
clusters |> Seq.iter (fun c -> c.Print)    
printfn "Error: %A" error    

Next time I will show how we can simplify the code by using F#’s built-in type vector instead of array.

https://edpillsbelgium.com

Tags: Centroid, Clustering, F#, k-means, KMeans

Feedback:

Comments (6)

Saturday, 1. November 2008

Nov 01

Damerau-Levenshtein-Distance in F# – part III – O(m+n) space and functional style

Filed under: BioInformatik,F#,Informatik,Mathematik — Steffen Forkmann at 16:31 Uhr

In the first part of this series I showed a naïve algorithm for the Damerau-Levenshtein distance which needs O(m*n) space. In the last post I improved the algorithm to use only O(m+n) space. This time I will show a more functional implementation which uses only immutable F#-Lists and works still in O(m+n) space. This version doesn’t need any mutable data.

/// Calcs the damerau levenshtein distance.    
let calcDL (a:'a array) (b: 'a array) =       
  let n = a.Length + 1
  let m = b.Length + 1
  
  let processCell i j act l1 l2 ll1 =
    let cost = 
      if a.[i-1] = b.[j-1] then 0 else 1
    let deletion = l2 + 1
    let insertion = act + 1
    let substitution = l1 + cost
    let min1 =  
      deletion 
      |> min insertion 
      |> min substitution

    if i > 1 && j > 1 &&
      a.[i-1] = b.[j-2] && a.[i-2] = b.[j-1] then
        min min1 <| ll1 + cost
    else
      min1
  
  let processLine i lastL lastLastL =
    let processNext (actL,lastL,lastLastL) j =
      match actL with 
        | act::actRest -> 
          match lastL with
            | l1::l2::lastRest -> 
              if i > 1 && j > 1 then
                match lastLastL with
                  | ll1::lastLastRest -> 
                    (processCell i j act l1 l2 ll1 :: actL,
                     l2::lastRest,
                     lastLastRest)
                  | _ -> failwith "can't be"
              else
                (processCell i j act l1 l2 0 :: actL,
                 l2::lastRest,
                 lastLastL)                 
            | _ -> failwith "can't be"
        | [] -> failwith "can't be"
      
    let (act,last,lastLast) =
      [1..b.Length]
        |> List.fold_left processNext ([i],lastL,lastLastL)
    act |> List.rev
    
  let (lastLine,lastLastLine) =               
    [1..a.Length]
      |> List.fold_left
          (fun (lastL,lastLastL) i -> 
             (processLine i lastL lastLastL,lastL))
          ([0..m-1],[])
              
  lastLine.[b.Length]    
 
let damerauLevenshtein(a:'a array) (b:'a array) =
  if a.Length > b.Length then
    calcDL a b
  else
    calcDL b a

I admit the code is still a little messy but it works fine. Maybe I will find the time to cleanup a bit and post a final version.

Tags: alignment, Damerau, damerau-levenshtein, dna, dynamic programming, edit distance, F#, Levenshtein algorithm, Levenshtein distance

Feedback:

Comments (1)

Nov 01

Damerau-Levenshtein-Distance in F# – part II – O(m+n) space

Filed under: BioInformatik,F#,Informatik,Mathematik — Steffen Forkmann at 14:40 Uhr

Last time I showed a naïve implementation of the Damerau-Levenshtein-Distance in F# that needs O(m*n) space. This is really bad if we want to compute the edit distance of large sequences (e.g. DNA sequences). If we look at the algorithm we can easily see that only the last two lines of the (n*m)-matrix are used. This observation leads to a improvement where we compute the distance with only 3 additional arrays of size min(n,m).

/// Calcs the damerau levenshtein distance.    
let calcDL (a:'a array) (b: 'a array) =       
  let n = a.Length + 1
  let m = b.Length + 1
  let lastLine = ref (Array.init m (fun i -> i))
  let lastLastLine = ref (Array.create m 0)
  let actLine = ref (Array.create m 0)
    
  for i in [1..a.Length] do
    (!actLine).[0] <- i      
    for j in [1..b.Length] do          
      let cost = 
        if a.[i-1] = b.[j-1] then 0 else 1
      let deletion = (!lastLine).[j] + 1
      let insertion = (!actLine).[j-1] + 1
      let substitution = (!lastLine).[j-1] + cost
      (!actLine).[j] <- 
        deletion 
        |> min insertion 
        |> min substitution

      if i > 1 && j > 1 then
        if a.[i-1] = b.[j-2] && a.[i-2] = b.[j-1] then
          let transposition = (!lastLastLine).[j-2] + cost  
          (!actLine).[j] <- min (!actLine).[j] transposition
    
    // swap lines
    let temp = !lastLastLine
    lastLastLine := !lastLine
    lastLine := !actLine
    actLine := temp
            
  (!lastLine).[b.Length]

 
let damerauLevenshtein(a:'a array) (b:'a array) =
  if a.Length > b.Length then
    calcDL a b
  else
    calcDL b a

This version of the algorithm needs only O(n+m) space but is not really "functional" style. I will show a more "F#-stylish" version in part III.

Tags: alignment, Damerau, damerau-levenshtein, dna, dynamic programming, edit distance, F#, Levenshtein algorithm, Levenshtein distance

Feedback:

Comments (0)

Friday, 31. October 2008

Oct 31

Damerau-Levenshtein-Distance in F# – part I

Filed under: BioInformatik,F#,Informatik — Steffen Forkmann at 17:12 Uhr

Today I am publishing an algorithm for calculating the Damerau-Levenshtein distance in F#. The Levenshtein distance is a metric that allows to measure the amount of difference between two sequences and shows how many edit operations (insert, delete, substitution) are needed to transform one sequence into the other. The Damerau-Levenshtein distance allows the transposition of two characters as an operation. It is often used for spelling corrections or to measure the variation (“edit distance”) between DNA sequences.

let damerauLevenshtein(a:'a array) (b:'a array) =

  let init i j =

    if j = 0 then i

    elif i = 0 then j else 0

  let n = a.Length + 1

  let m = b.Length + 1

  let d = Array2.init n m init

  for i in [1..a.Length] do

    for j in [1..b.Length] do

      let cost =

        if a.[i-1] = b.[j-1] then 0 else 1

      let deletion = d.[i-1, j] + 1

      let insertion = d.[i,j-1] + 1

      let substitution = d.[i-1,j-1] + cost

      d.[i, j] <-

        deletion

        |> min insertion

        |> min substitution

      if i > 1 && j > 1 && a.[i-1] = b.[j-2] &&

           a.[i-2] = b.[j-1] then

        let transposition = d.[i-2,j-2] + cost

        d.[i, j] <- min d.[i,j] transposition

  d.[a.Length, b.Length]

This naïve implementation needs quadratic space (O(m*n)). Since the algorithm is used to calculate the edit distance of large DNA sequences this is extremly bad. Next time I will show how we can get linear space (O(m+n)) for the algorithm.

Tags: alignment, Damerau, dna, dynamic programming, edit distance, F#, Levenshtein algorithm, Levenshtein distance

Feedback:

Comments (0)

Sunday, 18. May 2008

May 18

Vortragsfolien von der STC 2008 verfügbar

Filed under: BioInformatik,Dynamics NAV 2009,Informatik,Steffen,Veranstaltungen — Steffen Forkmann at 18:43 Uhr

Am Donnerstag war ich auf der Student Technology Conference 2008 in der Berliner Kalkscheune. Nachdem mich mein Navigationsgerät in die Oranienburger Straße in Reinickendorf statt Mitte geschickt hat und ich 10km weiter durch den Berufsverkehr fahren musste, konnte ich in der Kalkscheune gleich meinen ersten Kaffee zum Frühstück trinken. Wie es sich für eine vom Umweltministerium geförderte Konferenz gehört, wurde der in Pappbechern gereicht. 🙂 Im Laufe der Konferenz wurde dieses Missgeschick jedoch bemerkt und seitdem gab es nur noch Keramik.

Insgesamt hat mir die Konferenz sehr gut gefallen und nachdem ich zweimal bereits als Zuhörer auf der STC war, durfte ich dieses Jahr auch mal einen eigenen Vortrag halten. Die Folien dazu können nun hier herunter geladen werden.

Hier noch ein paar meiner Bilder aus Berlin:

PS: Bilder von der Konferenz werden demnächst sicher unter www.studentconference.de zu finden sein.

Nachtrag: Die Konferenzbilder sind jetzt unter http://www.schmolzeundkuehn.de/stc_2008 zu finden.

Tags: Berlin, optimierung, STC 2008, Student Technology Conference, Tourenplanung

Feedback:

Comments (0)

Friday, 9. May 2008

May 09

Vortrag auf der Student Technology Conference 2008

Filed under: .NET,BioInformatik,Dynamics NAV 2009,Informatik,Theoretische,Veranstaltungen — Steffen Forkmann at 12:52 Uhr

Aufgrund einer Sprecherabsage, habe ich kurzfristig einen Vortrag auf der STC 2008 bekommen. Die Veranstaltung steht dieses Jahr unter dem Motto “GreenIT”. Mein Vortrag wird deshalb auch etwas “grüner” als ein “normaler” Navision-Vortrag:

Die Umwelt schonen und gleichzeitig Kosten sparen

Tourenoptimierung in Dynamics NAV

Die strategische Tourenplanung für große Flotten ist ein so komplexes Problem, dass man keine optimale Lösung in vertretbarer Zeit berechnen kann. Der einzige Ausweg führt über intelligente Heuristiken, die in kurzer Zeit Lösungen liefern, die möglichst nah an der optimalen Lösung liegen und damit helfen die Fahrtkosten und den Benzinverbrauch zu minimieren. Der Vortrag stellt einige dieser Verfahren vor und zeigt wie eine Implementation im ERP-System „Microsoft Dynamics NAV“ aussehen könnte.

Gleichzeitig bekomme ich auch die Gelegenheit als einer der ersten die neue Navision-Version “Dynamics NAV 2009” öffentlich zu zeigen.

Weitere Informationen gibt es in der Agenda.

Tags: Dynamics NAV 2009, optimierung, STC2008, Student Technology Conference, Tourplanung mit Zeitfenstern

Feedback:

Comments (3)

Monday, 5. February 2007

Feb 05

Projektarbeit am IPB Halle

Filed under: BioInformatik — Sebastian Wolf at 10:23 Uhr

Am Leibniz Institut für Pflanzenbiochemie mache ich über die vorlesungsfreie Zeit eine Projektarbeit zusammen mit Michael Gerlich. Diese hat folgendes Thema: Analysepipeline für FTICR Daten

Am IPB wird ein FTICR Massenspektrometer u.a. zur Metabolitenanalyse betrieben.

Ziel des Projektes ist es, die für LC-MS bestehende Analysepipeline der Arbeitsgruppe für die Verarbeitung der FTICR Daten anzupassen. Dazu gehören folgende Aufgaben:

Konvertierung der Bruker Rohdaten in das mzData Format. Entweder über die Software CompassXport oder die Tools des ProteomeCommons Projektes (Evaluation & Auswahl).
Import des mzData Files in die IPB-eigene Datenbank (Dieser Teil ist bereits implementiert).
Signalverarbeitung der FTICR Daten mit Bioconductor/PROcess oder OpenMS tools.
Zuordnung der Massen aus verschiedenen Messungen
Clustering der Daten mit den Algorithmen aus unserer Toolbox

Das System soll als Pipeline in R und/oder Java implementiert werden, und über JavaServerFaces bedient werden.

Diese Projektarbeit umfasst auch eine Hausarbeit.

https://edpillsbelgium.nl

Tags: BioInformatik

Feedback:

Comments (1)

Thursday, 16. June 2005

Jun 16

Bioinformatik: Mustererkennung zur AIDS-Heilung

Filed under: BioInformatik — Steffen Forkmann at 12:19 Uhr

Unter http://www.research.microsoft.com/videos/hiv.wmv gibt es ein interessantes Video zur Mustererkennung und der Anwendung in der AIDS-Forschung.

In der Informatik gibt es aber jedoch noch viele weitere Problemstellungen, die fehlertolerante Mustervergleiche erfordern. Hierunter fallen die Spracherkennung, der Vergleich von Fingerabdrücken sowie die Sequenzanalyse in der Bioinformatik. Wie man effizient Sequenzen vergleichen kann, wird in dieser Arbeit beschrieben.

Tags: aids, alignment, BioInformatik, forschung, hiv, Informatik, microsoft, research, wmv

Feedback:

Comments (0)

Rash thoughts about .NET, C#, F# and Dynamics NAV.

Search:

Main:

Pages:

Archives:

Meta:

"Every solution will only lead to new problems."

Category BioInformatik

Saturday, 31. January 2009

n-dimensional k-means clustering with F# – part II

Thursday, 29. January 2009

n-dimensional k-means clustering with F#

Saturday, 1. November 2008

Damerau-Levenshtein-Distance in F# – part III – O(m+n) space and functional style

Damerau-Levenshtein-Distance in F# – part II – O(m+n) space

Friday, 31. October 2008

Damerau-Levenshtein-Distance in F# – part I

Sunday, 18. May 2008

Vortragsfolien von der STC 2008 verfügbar

Friday, 9. May 2008

Vortrag auf der Student Technology Conference 2008

Die Umwelt schonen und gleichzeitig Kosten sparen

Tourenoptimierung in Dynamics NAV

Monday, 5. February 2007

Projektarbeit am IPB Halle

Thursday, 16. June 2005

Bioinformatik: Mustererkennung zur AIDS-Heilung