Rash thoughts about .NET, C#, F# and Dynamics NAV.


"Every solution will only lead to new problems."

Saturday, 1. November 2008


Damerau-Levenshtein-Distance in F# – part III – O(m+n) space and functional style

Filed under: Informatik,BioInformatik,Mathematik,F# — Steffen Forkmann at 16:31 Uhr

In the first part of this series I showed a naïve algorithm for the Damerau-Levenshtein distance which needs O(m*n) space. In the last post I improved the algorithm to use only O(m+n) space. This time I will show a more functional implementation which uses only immutable F#-Lists and works still in O(m+n) space. This version doesn’t need any mutable data.

/// Calcs the damerau levenshtein distance.    
let calcDL (a:'a array) (b: 'a array) =       
  let n = a.Length + 1
  let m = b.Length + 1
  
  let processCell i j act l1 l2 ll1 =
    let cost = 
      if a.[i-1] = b.[j-1] then 0 else 1
    let deletion = l2 + 1
    let insertion = act + 1
    let substitution = l1 + cost
    let min1 =  
      deletion 
      |> min insertion 
      |> min substitution

    if i > 1 && j > 1 &&
      a.[i-1] = b.[j-2] && a.[i-2] = b.[j-1] then
        min min1 <| ll1 + cost
    else
      min1
  
  let processLine i lastL lastLastL =
    let processNext (actL,lastL,lastLastL) j =
      match actL with 
        | act::actRest -> 
          match lastL with
            | l1::l2::lastRest -> 
              if i > 1 && j > 1 then
                match lastLastL with
                  | ll1::lastLastRest -> 
                    (processCell i j act l1 l2 ll1 :: actL,
                     l2::lastRest,
                     lastLastRest)
                  | _ -> failwith "can't be"
              else
                (processCell i j act l1 l2 0 :: actL,
                 l2::lastRest,
                 lastLastL)                 
            | _ -> failwith "can't be"
        | [] -> failwith "can't be"
      
    let (act,last,lastLast) =
      [1..b.Length]
        |> List.fold_left processNext ([i],lastL,lastLastL)
    act |> List.rev
    
  let (lastLine,lastLastLine) =               
    [1..a.Length]
      |> List.fold_left
          (fun (lastL,lastLastL) i -> 
             (processLine i lastL lastLastL,lastL))
          ([0..m-1],[])
              
  lastLine.[b.Length]    
 
let damerauLevenshtein(a:'a array) (b:'a array) =
  if a.Length > b.Length then
    calcDL a b
  else
    calcDL b a

I admit the code is still a little messy but it works fine. Maybe I will find the time to cleanup a bit and post a final version.

Tags: , , , , , , , ,

Damerau-Levenshtein-Distance in F# – part II – O(m+n) space

Filed under: Informatik,BioInformatik,Mathematik,F# — Steffen Forkmann at 14:40 Uhr

Last time I showed a naïve implementation of the Damerau-Levenshtein-Distance in F# that needs O(m*n) space. This is really bad if we want to compute the edit distance of large sequences (e.g. DNA sequences). If we look at the algorithm we can easily see that only the last two lines of the (n*m)-matrix are used. This observation leads to a improvement where we compute the distance with only 3 additional arrays of size min(n,m).

/// Calcs the damerau levenshtein distance.    
let calcDL (a:'a array) (b: 'a array) =       
  let n = a.Length + 1
  let m = b.Length + 1
  let lastLine = ref (Array.init m (fun i -> i))
  let lastLastLine = ref (Array.create m 0)
  let actLine = ref (Array.create m 0)
    
  for i in [1..a.Length] do
    (!actLine).[0] <- i      
    for j in [1..b.Length] do          
      let cost = 
        if a.[i-1] = b.[j-1] then 0 else 1
      let deletion = (!lastLine).[j] + 1
      let insertion = (!actLine).[j-1] + 1
      let substitution = (!lastLine).[j-1] + cost
      (!actLine).[j] <- 
        deletion 
        |> min insertion 
        |> min substitution

      if i > 1 && j > 1 then
        if a.[i-1] = b.[j-2] && a.[i-2] = b.[j-1] then
          let transposition = (!lastLastLine).[j-2] + cost  
          (!actLine).[j] <- min (!actLine).[j] transposition
    
    // swap lines
    let temp = !lastLastLine
    lastLastLine := !lastLine
    lastLine := !actLine
    actLine := temp
            
  (!lastLine).[b.Length]

 
let damerauLevenshtein(a:'a array) (b:'a array) =
  if a.Length > b.Length then
    calcDL a b
  else
    calcDL b a

This version of the algorithm needs only O(n+m) space but is not really "functional" style. I will show a more "F#-stylish" version in part III.

Tags: , , , , , , , ,

Friday, 31. October 2008


Damerau-Levenshtein-Distance in F# – part I

Filed under: Informatik,BioInformatik,F# — Steffen Forkmann at 17:12 Uhr

Today I am publishing an algorithm for calculating the Damerau-Levenshtein distance in F#. The Levenshtein distance is a metric that allows to measure the amount of difference between two sequences and shows how many edit operations (insert, delete, substitution) are needed to transform one sequence into the other. The Damerau-Levenshtein distance allows the transposition of two characters as an operation. It is often used for spelling corrections or to measure the variation (“edit distance”) between DNA sequences.

let damerauLevenshtein(a:'a array) (b:'a array) =       
  let init i j =
    if j = 0 then i
    elif i = 0 then j else 0
  let n = a.Length + 1
  let m = b.Length + 1
 
  let d = Array2.init n m init
 
  for i in [1..a.Length] do
    for j in [1..b.Length] do          
      let cost = 
        if a.[i-1] = b.[j-1] then 0 else 1
      let deletion = d.[i-1, j] + 1
      let insertion = d.[i,j-1] + 1
      let substitution = d.[i-1,j-1] + cost
      d.[i, j] <- 
        deletion 
        |> min insertion 
        |> min substitution
 
      if i > 1 && j > 1 && a.[i-1] = b.[j-2] && 
           a.[i-2] = b.[j-1] then
        let transposition = d.[i-2,j-2] + cost  
        d.[i, j] <- min d.[i,j] transposition  
 
  d.[a.Length, b.Length]  

This naïve implementation needs quadratic space (O(m*n)). Since the algorithm is used to calculate the edit distance of large DNA sequences this is extremly bad. Next time I will show how we can get linear space (O(m+n)) for the algorithm.

Tags: , , , , , , ,

Thursday, 16. June 2005


Bioinformatik: Mustererkennung zur AIDS-Heilung

Filed under: BioInformatik — Steffen Forkmann at 12:19 Uhr

Unter http://www.research.microsoft.com/videos/hiv.wmv gibt es ein interessantes Video zur Mustererkennung und der Anwendung in der AIDS-Forschung.

In der Informatik gibt es aber jedoch noch viele weitere Problemstellungen, die fehlertolerante Mustervergleiche erfordern. Hierunter fallen die Spracherkennung, der Vergleich von Fingerabdrücken sowie die Sequenzanalyse in der Bioinformatik. Wie man effizient Sequenzen vergleichen kann, wird in dieser Arbeit beschrieben.

Tags: , , , , , , , ,