In the last post I show how we can calculate clusters of n-dimensional Objects with the K-means-Algorithm in F#. This time I will simplify the code by using the built-in type vector.
First of all we redefine the interface for “clusterable” objects:
type IClusterable =
abstract DimValues: vector with get
This time we do not need any dimension information – the vector type will care about this. Now we can reduce our distance function to:
let calcDist (p1:vector) (p2:vector) = (p1 - p2).Norm
The function Norm calculates the 2-Norm of a vector.
So calcDist will still compute the n-dimensional Euclidean distance.
The next step is the definition of a n-dimensional Centroid type. This time we will only use a type alias of vector:
type Centroid = vector
The calculation of the centroid and the centroid error can be written as:
let calcCentroid (items:'a list when 'a :> IClusterable)
(oldCentroid:Centroid) =
let count = items.Length
if count = 0 then oldCentroid else
let sum =
items
|> List.fold_left
(fun vec item -> vec + item.DimValues) // vector addition
(nullVector oldCentroid)
sum * (1./(count |> float)) // multiply with scalar
let calcError centroid (items:'a list when 'a :> IClusterable) =
let sq x = x * x
items
|> List.sum_by (fun i -> calcDist centroid i.DimValues |> sq)
The calcError function is straight forward but not perfect in terms of performance. As we use the calcDist function and hence the 2-Norm the function will first calculate a square root and then squares the result.
The rest of the code is pretty much same as in the last post:
/// creates n-dimensional 0 - vector
let nullVector (v:vector) = v * 0.
type Cluster<'a> when 'a :> IClusterable =
{Items: 'a list;
Count: int;
Centroid: Centroid;
Error: float}
member x.Print =
printfn "(Items: %d, Centroid: %A, Error: %.2f)"
x.Count x.Centroid x.Error
static member CreateCluster (item:'a) =
let items = [item]
let empty = nullVector item.DimValues
let centroid = calcCentroid items empty
{new Cluster<'a> with
Items = items and
Count = 1 and
Centroid = centroid and
Error = calcError centroid items}
member x.EvolveCluster (items:'a list) =
let l = items.Length
let centroid = calcCentroid items x.Centroid
{x with
Items = items;
Count = l;
Centroid = centroid;
Error = calcError centroid items}
let rand = new Random()
let InitClusters (k:int)
(items:'a array when 'a :> IClusterable) =
let length = items.Length
let getInitItem() = items.[rand.Next length]
Array.init k (fun x ->
getInitItem()
|> Cluster<'a>.CreateCluster)
let AssignItemsToClusters k (clusters:Cluster<'a> array)
(items:'a seq when 'a :> IClusterable) =
if k <= 0 then failwith "KMeans needs k > 0"
let findNearestCluster (item:'a) =
let minDist,nearest,lastPos =
clusters
|> Array.fold_left
(fun (minDist,nearest,pos) cluster ->
let distance = calcDist item.DimValues (cluster.Centroid)
if distance < minDist then
(distance,pos,pos+1)
else
(minDist,nearest,pos+1))
(Double.PositiveInfinity,0,0)
nearest
let assigned =
items
|> Seq.map (fun item -> item,findNearestCluster item)
let newClusters = Array.create k []
for item,nearest in assigned do
newClusters.[nearest] <- item::(newClusters.[nearest])
clusters
|> Array.mapi (fun i c -> c.EvolveCluster newClusters.[i])
let calcClusterError (clusters:Cluster<'a> array) =
clusters
|> Array.fold_left (fun acc cluster -> acc + cluster.Error) 0.
let kMeansClustering K epsilon
(items:'a array when 'a :> IClusterable) =
let k = if K <= 0 then 1 else K
let rec clustering lastError (clusters:Cluster<'a> array) =
let newClusters =
AssignItemsToClusters k clusters items
let newError = calcClusterError newClusters
if abs(lastError - newError) > epsilon then
clustering newError newClusters
else
newClusters,newError
InitClusters k items
|> clustering Double.PositiveInfinity
This vector approach shortens the code for the kMeans algorithm, but has a little longer runtime. I am interested in any suggestions.
Tags:
Centroid,
Clustering,
k-means,
KMeans,
vector
The K-means-Algorithm is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. In this article I will show how we can implement this in F#.
First of all we define an interface for “clusterable” objects:
type IClusterable =
abstract DimValues: float array with get
abstract Dimensions: int
The next step is to define a distance function. We will use n-dimensional Euclidean distance here.
let calcDist (p1:IClusterable) (p2:IClusterable) =
let sq x = x * x
if p1.Dimensions <> p2.Dimensions then
failwith "Cluster dimensions aren’t equal."
p2.DimValues
|> Array.fold2
(fun acc x y -> x – y |> sq |> (+) acc)
0. p1.DimValues
|> sqrt
Now we define a n-dimensional Centroid type. Our kMeans-Algorithm will minimize the squared distances to k centroids (or “means”).
type Centroid =
{Values: float array;
dimensions: int}
member x.Print = printfn "%A" x.Values
interface IClusterable with
member x.DimValues = x.Values
member x.Dimensions = x.dimensions
The centroid of a finite set of n-dimensional points x1, x2, …, xn is calculated as
let calcCentroid (items:’a list when ‘a :> IClusterable)
(oldCentroid:Centroid) =
let count = items.Length
if count = 0 then oldCentroid else
let mValues =
[|for d in 0..oldCentroid.dimensions-1 do
let sum =
items
|> List.sumBy (fun item -> item.DimValues.[d])
yield sum / (count |> float)|]
{ Values = mValues;
dimensions = oldCentroid.dimensions}
We made a small modification – if we don’t have any items assigned to a cluster the centroid won’t change. This is important for the robustness of the algorithm.
Now we can calculate the squared errors to a centroid:
let calcError centroid (items:’a list when ‘a :> IClusterable) =
let calcDiffToCentroid (item:’a) =
centroid.Values
|> Array.fold2
(fun acc c i -> acc + (c-i)*(c-i))
0. item.DimValues
items
|> List.sumBy calcDiffToCentroid
For storing cluster information we create a cluster type:
type Cluster<‘a> when ‘a :> IClusterable =
{Items: ‘a list;
Count: int;
Centroid: Centroid;
Error: float}
member x.Print =
printfn "(Items: %d, Centroid: %A, Error: %.2f)"
x.Count x.Centroid x.Error
static member CreateCluster dimensions (item:’a) =
let items = [item]
let empty =
{ Values = [||];
dimensions = dimensions}
let centroid = calcCentroid items empty
{ Items = items;
Count = 1;
Centroid = centroid;
Error = calcError centroid items}
member x.EvolveCluster (items:’a list) =
let l = items.Length
let centroid = calcCentroid items x.Centroid
{x with
Items = items;
Count = l;
Centroid = centroid;
Error = calcError centroid items}
Now we need a function to initialize the clusters and a function to assign a items to the nearest cluster:
open System
let rand = new Random()
let InitClusters (k:int) dimensions
(items:’a array when ‘a :> IClusterable) =
let length = items.Length
let getInitItem() = items.[rand.Next length]
Array.init k (fun _ ->
getInitItem()
|> Cluster<‘a>.CreateCluster dimensions)
let AssignItemsToClusters k dimensions (clusters:Cluster<‘a> array)
(items:’a seq when ‘a :> IClusterable) =
if k <= 0 then failwith "KMeans needs k > 0"
let findNearestCluster item =
let minDist,nearest,lastPos =
clusters
|> Array.fold
(fun (minDist,nearest,pos) cluster ->
let distance = calcDist item (cluster.Centroid)
if distance < minDist then
(distance,pos,pos+1)
else
(minDist,nearest,pos+1))
(Double.PositiveInfinity,0,0)
nearest
let assigned =
items
|> Seq.map (fun item -> item,findNearestCluster item)
let newClusters = Array.create k []
for item,nearest in assigned do
newClusters.[nearest] <- item::(newClusters.[nearest])
clusters
|> Array.mapi (fun i c -> c.EvolveCluster newClusters.[i])
The last step is a function which calculates the squared error over all clusters:
let calcClusterError (clusters:Cluster<‘a> array) =
clusters
|> Array.sumBy (fun cluster -> cluster.Error)
Now it is an easy task to write the kMeans algorithm:
let kMeansClustering K dimensions epsilon
(items:’a array when ‘a :> IClusterable) =
let k = if K <= 0 then 1 else K
let rec clustering lastError (clusters:Cluster<‘a> array) =
let newClusters =
AssignItemsToClusters k dimensions clusters items
let newError = calcClusterError newClusters
if abs(lastError – newError) > epsilon then
clustering newError newClusters
else
newClusters,newError
InitClusters k dimensions items
|> clustering Double.PositiveInfinity
We can test this algorithm with random 2-dimensional points:
type Point =
{X:float; Y: float}
member x.Print = printfn "(%.2f,%.2f)" x.X x.Y
interface IClusterable with
member x.DimValues = [| x.X; x.Y |]
member x.Dimensions = 2
let point x y = {X = x; Y = y}
let getRandomCoordinate() = rand.NextDouble() – 0.5 |> (*) 10.
let randPoint _ = point (getRandomCoordinate()) (getRandomCoordinate())
let points n = Array.init n randPoint
let items = points 1000
printfn "Items:"
items |> Seq.iter (fun i -> i.Print)
let clusters,error = kMeansClustering 3 2 0.0001 items
printfn "\n"
clusters |> Seq.iter (fun c -> c.Print)
printfn "Error: %A" error
Next time I will show how we can simplify the code by using F#’s built-in type vector instead of array.
Tags:
Centroid,
Clustering,
F#,
k-means,
KMeans
In the first part of this series I showed a naïve algorithm for the Damerau-Levenshtein distance which needs O(m*n) space. In the last post I improved the algorithm to use only O(m+n) space. This time I will show a more functional implementation which uses only immutable F#-Lists and works still in O(m+n) space. This version doesn’t need any mutable data.
/// Calcs the damerau levenshtein distance.
let calcDL (a:'a array) (b: 'a array) =
let n = a.Length + 1
let m = b.Length + 1
let processCell i j act l1 l2 ll1 =
let cost =
if a.[i-1] = b.[j-1] then 0 else 1
let deletion = l2 + 1
let insertion = act + 1
let substitution = l1 + cost
let min1 =
deletion
|> min insertion
|> min substitution
if i > 1 && j > 1 &&
a.[i-1] = b.[j-2] && a.[i-2] = b.[j-1] then
min min1 <| ll1 + cost
else
min1
let processLine i lastL lastLastL =
let processNext (actL,lastL,lastLastL) j =
match actL with
| act::actRest ->
match lastL with
| l1::l2::lastRest ->
if i > 1 && j > 1 then
match lastLastL with
| ll1::lastLastRest ->
(processCell i j act l1 l2 ll1 :: actL,
l2::lastRest,
lastLastRest)
| _ -> failwith "can't be"
else
(processCell i j act l1 l2 0 :: actL,
l2::lastRest,
lastLastL)
| _ -> failwith "can't be"
| [] -> failwith "can't be"
let (act,last,lastLast) =
[1..b.Length]
|> List.fold_left processNext ([i],lastL,lastLastL)
act |> List.rev
let (lastLine,lastLastLine) =
[1..a.Length]
|> List.fold_left
(fun (lastL,lastLastL) i ->
(processLine i lastL lastLastL,lastL))
([0..m-1],[])
lastLine.[b.Length]
let damerauLevenshtein(a:'a array) (b:'a array) =
if a.Length > b.Length then
calcDL a b
else
calcDL b a
I admit the code is still a little messy but it works fine. Maybe I will find the time to cleanup a bit and post a final version.
Tags:
alignment,
Damerau,
damerau-levenshtein,
dna,
dynamic programming,
edit distance,
F#,
Levenshtein algorithm,
Levenshtein distance
Last time I showed a naïve implementation of the Damerau-Levenshtein-Distance in F# that needs O(m*n) space. This is really bad if we want to compute the edit distance of large sequences (e.g. DNA sequences). If we look at the algorithm we can easily see that only the last two lines of the (n*m)-matrix are used. This observation leads to a improvement where we compute the distance with only 3 additional arrays of size min(n,m).
/// Calcs the damerau levenshtein distance.
let calcDL (a:'a array) (b: 'a array) =
let n = a.Length + 1
let m = b.Length + 1
let lastLine = ref (Array.init m (fun i -> i))
let lastLastLine = ref (Array.create m 0)
let actLine = ref (Array.create m 0)
for i in [1..a.Length] do
(!actLine).[0] <- i
for j in [1..b.Length] do
let cost =
if a.[i-1] = b.[j-1] then 0 else 1
let deletion = (!lastLine).[j] + 1
let insertion = (!actLine).[j-1] + 1
let substitution = (!lastLine).[j-1] + cost
(!actLine).[j] <-
deletion
|> min insertion
|> min substitution
if i > 1 && j > 1 then
if a.[i-1] = b.[j-2] && a.[i-2] = b.[j-1] then
let transposition = (!lastLastLine).[j-2] + cost
(!actLine).[j] <- min (!actLine).[j] transposition
// swap lines
let temp = !lastLastLine
lastLastLine := !lastLine
lastLine := !actLine
actLine := temp
(!lastLine).[b.Length]
let damerauLevenshtein(a:'a array) (b:'a array) =
if a.Length > b.Length then
calcDL a b
else
calcDL b a
This version of the algorithm needs only O(n+m) space but is not really "functional" style. I will show a more "F#-stylish" version in part III.
Tags:
alignment,
Damerau,
damerau-levenshtein,
dna,
dynamic programming,
edit distance,
F#,
Levenshtein algorithm,
Levenshtein distance
Today I am publishing an algorithm for calculating the Damerau-Levenshtein distance in F#. The Levenshtein distance is a metric that allows to measure the amount of difference between two sequences and shows how many edit operations (insert, delete, substitution) are needed to transform one sequence into the other. The Damerau-Levenshtein distance allows the transposition of two characters as an operation. It is often used for spelling corrections or to measure the variation (“edit distance”) between DNA sequences.
let damerauLevenshtein(a:'a array) (b:'a array) =
let init i j =
if j = 0 then i
elif i = 0 then j else 0
let n = a.Length + 1
let m = b.Length + 1
let d = Array2.init n m init
for i in [1..a.Length] do
for j in [1..b.Length] do
let cost =
if a.[i-1] = b.[j-1] then 0 else 1
let deletion = d.[i-1, j] + 1
let insertion = d.[i,j-1] + 1
let substitution = d.[i-1,j-1] + cost
d.[i, j] <-
deletion
|> min insertion
|> min substitution
if i > 1 && j > 1 && a.[i-1] = b.[j-2] &&
a.[i-2] = b.[j-1] then
let transposition = d.[i-2,j-2] + cost
d.[i, j] <- min d.[i,j] transposition
d.[a.Length, b.Length]
This naïve implementation needs quadratic space (O(m*n)). Since the algorithm is used to calculate the edit distance of large DNA sequences this is extremly bad. Next time I will show how we can get linear space (O(m+n)) for the algorithm.
Tags:
alignment,
Damerau,
dna,
dynamic programming,
edit distance,
F#,
Levenshtein algorithm,
Levenshtein distance
Am Donnerstag war ich auf der Student Technology Conference 2008 in der Berliner Kalkscheune. Nachdem mich mein Navigationsgerät in die Oranienburger Straße in Reinickendorf statt Mitte geschickt hat und ich 10km weiter durch den Berufsverkehr fahren musste, konnte ich in der Kalkscheune gleich meinen ersten Kaffee zum Frühstück trinken. Wie es sich für eine vom Umweltministerium geförderte Konferenz gehört, wurde der in Pappbechern gereicht. 🙂 Im Laufe der Konferenz wurde dieses Missgeschick jedoch bemerkt und seitdem gab es nur noch Keramik.
Insgesamt hat mir die Konferenz sehr gut gefallen und nachdem ich zweimal bereits als Zuhörer auf der STC war, durfte ich dieses Jahr auch mal einen eigenen Vortrag halten. Die Folien dazu können nun hier herunter geladen werden.
Hier noch ein paar meiner Bilder aus Berlin:
PS: Bilder von der Konferenz werden demnächst sicher unter www.studentconference.de zu finden sein.
Nachtrag: Die Konferenzbilder sind jetzt unter http://www.schmolzeundkuehn.de/stc_2008 zu finden.
Tags:
Berlin,
optimierung,
STC 2008,
Student Technology Conference,
Tourenplanung
Aufgrund einer Sprecherabsage, habe ich kurzfristig einen Vortrag auf der STC 2008 bekommen. Die Veranstaltung steht dieses Jahr unter dem Motto “GreenIT”. Mein Vortrag wird deshalb auch etwas “grüner” als ein “normaler” Navision-Vortrag:
Die strategische Tourenplanung für große Flotten ist ein so komplexes Problem, dass man keine optimale Lösung in vertretbarer Zeit berechnen kann. Der einzige Ausweg führt über intelligente Heuristiken, die in kurzer Zeit Lösungen liefern, die möglichst nah an der optimalen Lösung liegen und damit helfen die Fahrtkosten und den Benzinverbrauch zu minimieren. Der Vortrag stellt einige dieser Verfahren vor und zeigt wie eine Implementation im ERP-System „Microsoft Dynamics NAV“ aussehen könnte.
Gleichzeitig bekomme ich auch die Gelegenheit als einer der ersten die neue Navision-Version “Dynamics NAV 2009” öffentlich zu zeigen.
Weitere Informationen gibt es in der Agenda.
Tags:
Dynamics NAV 2009,
optimierung,
STC2008,
Student Technology Conference,
Tourplanung mit Zeitfenstern
Am Leibniz Institut für Pflanzenbiochemie mache ich über die vorlesungsfreie Zeit eine Projektarbeit zusammen mit Michael Gerlich. Diese hat folgendes Thema: Analysepipeline für FTICR Daten
Am IPB wird ein FTICR Massenspektrometer u.a. zur Metabolitenanalyse betrieben.
Ziel des Projektes ist es, die für LC-MS bestehende Analysepipeline der Arbeitsgruppe für die Verarbeitung der FTICR Daten anzupassen. Dazu gehören folgende Aufgaben:
- Konvertierung der Bruker Rohdaten in das mzData Format. Entweder über die Software CompassXport oder die Tools des ProteomeCommons Projektes (Evaluation & Auswahl).
- Import des mzData Files in die IPB-eigene Datenbank (Dieser Teil ist bereits implementiert).
- Signalverarbeitung der FTICR Daten mit Bioconductor/PROcess oder OpenMS tools.
- Zuordnung der Massen aus verschiedenen Messungen
- Clustering der Daten mit den Algorithmen aus unserer Toolbox
Das System soll als Pipeline in R und/oder Java implementiert werden, und über JavaServerFaces bedient werden.
Diese Projektarbeit umfasst auch eine Hausarbeit.
Tags:
BioInformatik
Unter http://www.research.microsoft.com/videos/hiv.wmv gibt es ein interessantes Video zur Mustererkennung und der Anwendung in der AIDS-Forschung.
In der Informatik gibt es aber jedoch noch viele weitere Problemstellungen, die fehlertolerante Mustervergleiche erfordern. Hierunter fallen die Spracherkennung, der Vergleich von Fingerabdrücken sowie die Sequenzanalyse in der Bioinformatik. Wie man effizient Sequenzen vergleichen kann, wird in dieser Arbeit beschrieben.
Tags:
aids,
alignment,
BioInformatik,
forschung,
hiv,
Informatik,
microsoft,
research,
wmv