Diff Simplified

This should’ve been the first of the series “Fantastic Failures”, b/c it feels this way. But the only failure is me unable to Google and filter for the Hunt-McIlroy algorithm, before I got frustrated and tried to free-lance write the algorithm from a short description I read about years ago. Only after I was done, did I discover the Hunt-McIlroy algorithm, in a dusty corner of Internet. And it is so much simpler, than my concoction.

This was 3 days later. I thought it would take 3 hours. I did not use the above algorithms (b/c I’m a man and my mansplain is I refuse to read instructions that I can’t cut and paste). I did find a cut and paste solution but until days later. However, my loss in time, I will pass on, of my understanding of the diff algorithm.

TLDR

If you ever want to know how Unix GNU diff, or Windows Windiff works, I think I have a demo, based on a article I read a few years ago. I free-lanced wrote the demo, based on my understanding of the idea. Because I wanted a Javascript version.

But the version I found on internet is simpler and better. You can download my copy of the Hunt-McIlroy Algorithm here. Then use this code snippet, to use it.

<script src="hunt-mcilroy.js"></script>
<script>
function rediff(){
  const a = document.all['a'].value.split("\n");
  const b = document.all['b'].value.split("\n");
  const out = diff(a, b);
  console.log(a);
  console.log(b);
  console.log(out);
  var texta=null;
  var textb=null;
  var textd=null;
  var len=null;
  const re = document.all['result'];
  re.innerHTML="";
  for(var i=0; i<out.length; i++){
          const data=out[i];
          if(data.common) {
              len=data.common.length;
              texta=data.common;
              textb=data.common;
              textd="Same";
          } else if(data.file1) {
              len=data.file1.length;
              texta=data.file1;
              textb=[];
              textd="Del";
          }
          for(var j=0; j<len; j++){
              const row = re.insertRow(-1);
              const lineA = row.insertCell(0);
              const lineD = row.insertCell(1);
              const lineB = row.insertCell(2);
              lineA.innerHTML = texta[j] ? texta[j] : "";
              lineB.innerHTML = textb[j] ? textb[j] : "";
              lineD.innerHTML = textd;
              lineD.noWrap = true;
              lineD.style.paddingRight = "10px";
          }
          if(data.file2) {
              len=data.file2.length;
              texta=[];
              textb=data.file2;
              textd="Add";
          } else
              len=0;
          for(var j=0; j<len; j++){
              const row = re.insertRow(-1);
              const lineA = row.insertCell(0);
              const lineD = row.insertCell(1);
              const lineB = row.insertCell(2);
              lineA.innerHTML = texta[j] ? texta[j] : "";
              lineB.innerHTML = textb[j] ? textb[j] : "";
              lineD.innerHTML = textd;
              lineD.noWrap = true;
              lineD.style.paddingRight = "10px";
          }
  }
}
</script>
<table width="100%" cellpadding=0 style="font-size:8pt">
<tr><td><textarea id="a" 
            placeholder="Enter text for A" 
            rows="15" style="width:100%" 
            onchange="rediff()"></textarea></td>
<td><textarea id="b" 
            placeholder="Enter text for B" 
            rows="15" style="width:100%" 
            onchange="rediff()"></textarea>
</table>
<table id="result">
</table>


Live Demo of Hunt-McIlroy Diff


Yes, this is a live demo of the code snippet. Enter text for A and B and press [tab] or Alt-[tab] see differences, live below.

 
 

Failure, just made me understand what Hunt-McIlroy does

Hunt-McIlroy's simplicity, just hides a lot of meaning, that you don't get the taste of, by just copying the recipe. And not varying it, or trying to recreate it.

1. You just need to keep track of matches between the two files. This limits data we have to hold.
2. Matches between the two files, must be grouped into pairs, as they occur, and matches are not re-used. This further prunes the above data we have to search for.
3. Only the path to next match pair matters, from the current match location. You do not need to worry about other paths to other matches being more optimal. There is no such thing. So only one path is considered.

Those 3 assumptions, make the algorithm much faster than searching algorithm, like mine all are.

 
 

Basic Idea, of how to display differences between 2 files

First, if 2 files are different, you can show the files side by side and say "They are different". A bad algorithm, will do exactly what I said, and put the 2 files next to each other and say "Different". But we want an algorithm, that picks out the specific details that are different.

I will dub my algorithm the Tictawf Diff Algorithm. It was inspired by description of diff, I read a long time ago, which was - (summarized) diff is to find the shortest path, in a MxN grid whose dimensions represent the lines in files M and N, and whose cells are marked which lines 2 files is the same.

Visually, we need to find the shortest path from beginning of both files, to end of both files, gobbling up matching lines. This depicted metaphorically in the demo grid below, where you need to find a path from upper-left corner(beginning of both files), to lower-right corner(end of both files).

Live demo, if you a human, were to generate a diff report

We the People of the United States, in Order to (to make a united country), establish (fairness for all), (keep the peace at home), (protect the nation from threats to it), (make good lies for people, providing happiness), and secure the Blessings of Liberty to ourselves and (keep freedom a part of the lives of future generations) , do ordain and establish this Constitution for the United States of America in Order to Amen
0 1 2 3 4 5 6 7 8 9 10 11 12
We the People of the United States,

0 T                        
in Order to

1   T                   T  
form a more perfect Union,

2                          
establish

3       T                  
Justice,

4                          
insure domestic Tranquility,

5                          
provide for the common defence,

6                          
promote the general Welfare,

7                          
and secure the Blessings of Liberty to ourselves

8                 T        
and our Posterity,

9                          
do ordain and establish this Constitution for the United States of America.

10                          
Be an algorithm, or pretend you want to make a report to your manager, about the changes made from left file to top(right) file.
Click on either the lines, or the path thru the grid. There is no way to mess this up. But some organization makes more sense, than others.

The clearest result, clusters the same matching text together, and clusters the changed texts together, and doesn't reflect matching text as changed text.

Once you understand how to produce a report of differences between 2 text files, using the metaphor in live demo above, you will start to realize that organizing certain paths, are clearer and has fewer lines. This is what diff algorithms try to accomplish. Below, I will explain my path to understanding how a heuristic group of rules, tries to recreate above. And then you realize that Hut-McIlroy algorithm has the simplest set of rules, and the fewest to produce the best result.

 
 

My failures at recreating diff from scratch

There are 3 revisions Tictawf Diff Algorithm, based on what I thought was a pathing problem. They are below.

Revision 1 of TictAwf.Diff
http://100.24.142.145/teach-diff.html

I went with 2 passes, but you can do it, in 1 pass. But two passes makes it easy to understand.

Step 1. Compare both files, and index where the lines match and don't. I stored this in 2d array. One dimension was number of lines in 1 file. And 2nd dimension was the number of lines in the 2nd file. Build the entire grid, based on whether the line in the file, indicated by the index of the dimension, matches, and store true/false.

You can subdivide using any criteria you wish, such as by byte, or word, or by token. But most diff programs divide the files by line. It is also easier to visualize.

File A

Line 1. We the People of the United States, 
Line 2. in Order to 
Line 3. form a more perfect Union, 
Line 4. establish 
Line 5. Justice, 
Line 6. insure domestic Tranquility, 
Line 7. provide for the common defence, 
Line 8. promote the general Welfare, 
Line 9. and secure the Blessings of Liberty to ourselves 
Line 10. and our Posterity, 
Line 11. do ordain and establish this Constitution for the United States of America.
  
File B

Line 1. We the People of the United States, 
Line 2. in Order to 
Line 3. (to make a united country), 
Line 4. establish 
Line 5. (fairness for all), 
Line 6. (keep the peace at home), 
Line 7. (protect the nation from threats to it), 
Line 8. (make good lies for people, providing happiness),
Line 9. and secure the Blessings of Liberty to ourselves 
Line 10. and (keep freedom a part of the lives of future generations) , 
Line 11. do ordain and establish this Constitution for the United States of America
Line 12. in Order to 
Line 13. Amen
  
T                         0
  T                   T   1
                          2
      T                   3
                          4
                          5
                          6
                          7
                T         8
                          9
                          10
0 1 2 3 4 5 6 7 8 9 10 11 12 11
Step 1 is to make the compare between the files into a grid (or graph, as will be shown later).

The examples are made easier, bc English sentences rarely repeat. Repeating sequences tend to wreak havoc on any version of diff I imagine in my head

I will ruin the suspense. The green path was the shortest path, found at end of step 2. This first pass, just finds matching lines, which are marked as T

Step 2. Find the path of least cost from (0,0) to (end1,end2). Rules are simple. Moving diagonally (+1,+1) from true to true, is zero cost. Moving (+1,0) or (0,+1) is cost 1. Those are the only 3 movement allowed. Find the path of least cost.

Hint: Once you have 1 solution, you can abort future solutions early as soon as they have more cost than existing solution.

Purpose of Step 2 is to arrange similarities between the two files, and establish relationships between them. What the program is doing, is thinking like a computer. It assumes you want to see the differences, but a computer program finds matches easier. So it is a game of mutual exclusion for the computer program, whatever isn't a match, is a difference. The question then becomes, how to represent this to a human, that he can understand. The metaphor the computer tries to recreate, is - what is the least changes need to be made to File A, to make it File B. This computer program does this by finding path using only the moves described at the begining of step2 section, from the top of the envelope, to the bottom of the envelope, including the most matches. The rules are there to prevent repeats, or representing the data in both files out of order.

We the People of the United States, in Order to
We the People of the United States, T    
in Order to   T

This is the part that confuses most people, when moving from the graph, back to text interpretation. The above movement T to T, produces cost of zero, but advances both file line by 2. Line 1 and 2 are "We the People...","in order" and exist in both files. So the next lines in the comparisons, for both files have to start at line 3. But the graph cost is zero, b/c the ideal comparison result is zero for a completely identical file. And we want the program to find the least cost.

in Order to

(to make a united country),

establish

in Order to

T    
form a more perfect Union,

     
establish

    T

Non-contiguous matches, incur a cost. Specifically, they cost how many single steps it takes to get to the next match. This example has +2 vertically, then +2 horizontally. This is translated as File A has 2 lines that don't match, and File B has 2 lines that don't match. Since we left off at line3 for both files, since lines 1 and 2 are the same, this is interpretated as 2 lines exist in File A that don't exist in B. And vice versa.
When I used to stock shelves, I did hunt and pecking, trying to find the right shelf to put the cans. This computer program looks for matching lines or T's. And is looking for the path with most T. Not remembering where they were.

Once, the rules are clear, this is the step, that leaves the most to interpretation - How to find the path. I will call this revision 1, the hunt and peck method of path searching. But since I am so clever, I noticed that most text that changed only a few lines, and decided to have the program hunt and peck diagonally, not knowing it makes no difference once you enhance the algorithm later.

T                           0
  T                   T     1
                            2
      T                     3
                            4
                            5
                            6
                            7
                T           8
                            9
                            10
                          T 11
0 1 2 3 4 5 6 7 8 9 10 11 12 13 12
So the TictAwf way of searching for a path, from T to T, involves looking in adjacent cells, then recursively searching in next adjacent cell, until it finds a T, and ends at the lower right envelope. It will unfortunately search every cell, as a result of it's naivete. This illustration is just right in the middle of a search. This pass-step results in the red cells, and eventually at end, when all the cost is tallied, the green path foreshadowed in step1's illustration. There are a number of optimizations that can be added to this naive implementation, such as if the first path found is zero, or very small, all the future search can be cut short, as soon as it seaches too many red cells (ie. if cost of first path was 1, then all future branches can only have maximum of 1 red cell, and the rest have to be T, and can quit as soon as it counts more than 1)

Step 3: Each movement in the path of least cost has an interpretation: (+1,+1) means next 1 lines from both files are same, (+1,0) means next 1 line in first file is extra in first file that doesn't exist in 2nd file (or interpreted as they were deleted from first file, to result in the second file), and (0,+1) means next 1 line in second file is extra in second file that doesn't exist in the first file (or interpreted as they were added to first file, to result in the second file). There is no such thing as "changing" a line. If it was changed, the interpretation is that the line was removed, and a different version added.

path instruction FileA pos

FileB pos

Interpretation

0: {dx: 1, dy: 1, origin: {…}, dest: {…}}

0

0

Line 1 matches in both File A and B. Advance both files.

1: {dx: 1, dy: 1, origin: null, dest: null}

1

1

Line 2 matches in both File A and B. Advance both files.

2: {dx: 1, dy: 0, origin: {…}, dest: null}

2->3

2(ignore)

Line 3 exists in only FileA. Only advance File A. Ignore FileB line (is 3 still)

3: {dx: 0, dy: 1, origin: null, dest: {…}}

3(ignore)

2->3

Line 3 exists in only FileB. Only advance File B. Ignore FileA line (is 4 still)

4: {dx: 1, dy: 1, origin: null, dest: null}

3->4

3->4

Line 4 matches in both File A and B. Advance both files.

5: {dx: 4, dy: 0, origin: {…}, dest: null}

4->8

4(ignore)

Line 5,6,7,8 exists in only FileA. Only advance File A x4. Ignore FileB line 4

6: {dx: 0, dy: 4, origin: null, dest: {…}}

8(ignore)

4->8

Line 5,6,7,8 exists in only FileB. Only advance File B x4. Ignore FileA line 8

7: {dx: 1, dy: 1, origin: null, dest: null}

8->9

8->9

Line 9 matches in both File A and B. Advance both files.

8: {dx: 2, dy: 0, origin: {…}, dest: null}

9->11

9(ignore)

Line 10,11 exists in only FileA. Only advance File A. Ignore FileA line 10

9: {dx: 0, dy: 4, origin: null, dest: null}

11(ignore)

10->14

Line 10,11,12,14 exists in only FileB. Only advance File B. Ignore FileA line 11
pos 0 is line1, pos 1 is line2, etc. At beginning before, adding on the dx or dy number. You add dx to File1 position. Add dy to File2 pos. When dx==0, then only lines in File B exist, in the aggregated diff listing. When dy==0, then only lines in File A, exist.

Step 4. Show each file, and interpretation assigned to that line.

FileA pos

FileB pos

Interpretation

0->1

0->1

Show File A line 1, File B line 1

1->2

1->2

Show File A line 2, File B line 2

2->3

2(ignore)

Show File A line 3

3(ignore)

2->3

Show File B line 3

3->4

3->4

Show File A line 4, File B line 4

4->8

4(ignore)

Show File A line 5,6,7,8

8(ignore)

4->8

Show File B line 5,6,7,8

8->9

8->9

Show File A line 9, File B line 9

9->11

9(ignore)

Show File A line 10,11

11(ignore)

10->14

Show File B line 10,11,12,13
Only include the start line numbers from prev table, and include which lines to ignore line.

Now replace in the table above, the not-ignored lines, with the actual lines from the file, and you should get the below.

File A                File B
We the People of the United States, Same We the People of the United States,
in Order to Same in Order to
form a more perfect Union, Deleted from A
Added to A (to make a united country),
establish Same establish
Justice, Deleted from A
insure domestic Tranquility, Deleted from A
provide for the common defence, Deleted from A
promote the general Welfare, Deleted from A
Added to A (fairness for all),
Added to A (keep the peace at home),
Added to A (protect the nation from threats to it),
Added to A (make good lies for people, providing happiness),
and secure the Blessings of Liberty to ourselves Same and secure the Blessings of Liberty to ourselves
and our Posterity, Deleted from A
do ordain and establish this Constitution for the United States of America. Deleted from A
Added to A and (keep freedom a part of the lives of future generations) ,
Added to A do ordain and establish this Constitution for the United States of America
Added to A in Order to
Added to A Amen

--------- Added 4/12/2014-----------

The version1 of program kept searching same paths again and again. So I added code, to save results from paths that had already been searched

Version2
http://100.24.142.145/teach-diff.dynamicprog.html

Step 1. Same...

Except there are many language structures, that create repetitions. And repetition make the pathing, very much unclear. As a result, it results in more searching. Plus length of the files makes a dramatic degradation in execution time. So to make it useful, some improvements have to be made.

Take for instance the lyrics below, which is both longer than the previous example, and has repetitive sections.

Work, work, work, work, work, work
He said me haffi work, work, work, work, work, work
He see me do mi dirt, dirt, dirt, dirt, dirt, dirt
So me put in work, work, work, work, work, work
When you ah gon' learn, learn, learn, learn, learn?
Me nuh care if him hurt, hurt, hurt, hurt, hurting
Dry, me ah desert him
No time to have you lurking
Him ah go act like he nuh like it
You know I dealt with you the nicest
Nuh body touch me you nuh righteous
Nuh badda, text me in a crisis
I believed all of your dreams, adoration
You took my heart and my keys and my patience
You took my heart on my sleeve for decoration
You mistaken my love I brought for you for foundation
All that I wanted from you was to give me
Something that I never had
Something that you've never seen
Something that you've never been, mm
But I wake up and act like nothing's wrong
Just get ready fi work, work, work, work, work, work
He said me haffi work, work, work, work, work, work
He see me do mi dirt, dirt, dirt, dirt, dirt, dirt
So me put in work, work, work, work, work, work
Na, na, na, na, na, na
When you ah gon' learn, learn, learn, learn, learn, learn?
Before the tables turn, turn, turn, turn, turn, turn
Beg you something, please
Baby don't you leave
Don't leave me stuck here in the streets, uh huh
If I get another chance to
I will never, no never neglect you
I mean who am I to hold your past against you?
I just hope that it gets to you
I hope that you see this through
I hope that you see this true
What can I say?
Please recognize I'm tryin', babe
I haffi work, work, work, work, work, work
He said me haffi work, work, work, work, work, work
He see me do mi dirt, dirt, dirt, dirt, dirt, dirt
So me put in work, work, work, work, work, work
When you ah gon' learn, learn, learn, learn, learn
Me nuh care if him hurt, hurt, hurt, hurt, hurting
Yeah, okay
You need to get done, done, done, done at work, come over
We just need to slow the motion
Don't give that away to no one
Long distance, I need you
When I see potential I just gotta see it through
If you had a twin, I would still choose you
I don't wanna rush into it, if it's too soon
But I know you need to get done, done, done, done
If you come over
Sorry if I'm way less friendly
I got niggas tryna end me, oh, yeah
I spilled all my emotions tonight, I'm sorry
Rollin', rollin', rollin', rollin', rollin'
How many more shots until you're rollin'?
We just need a face to face
You could pick the time and the place
You spent some time away
Now you need to forward
And give me all the work, work, work, work, work, work
He said me haffi work, work, work, work, work, work
He see me do mi dirt, dirt, dirt, dirt, dirt, dirt
So me put in work, work, work, work, work, work
When you ah gon' learn, learn, learn, learn, learn
Me nuh care if him hurt, hurt, hurt, hurt, hurting
Mm
Mm
Work, work, work, work, work, work
Mm

Compare that to itself, and you still get .

T                                                                                                                                               T   0
  T                                         T                                   T                                                 T                 1
    T                                         T                                   T                                                 T               2
      T                                         T                                   T                                                 T             3
        T                                                                                                                                           4
          T                                                                             T                                                 T         5
            T                                                                                                                                       6
              T                                                                                                                                     7
                T                                                                                                                                   8
                  T                                                                                                                                 9
                    T                                                                                                                               10
                      T                                                                                                                             11
                        T                                                                                                                           12
                          T                                                                                                                         13
                            T                                                                                                                       14
                              T                                                                                                                     15
                                T                                                                                                                   16
                                  T                                                                                                                 17
                                    T                                                                                                               18
                                      T                                                                                                             19
                                        T                                                                                                           20
                                          T                                                                                                         21
  T                                         T                                   T                                                 T                 22
    T                                         T                                   T                                                 T               23
      T                                         T                                   T                                                 T             24
                                                  T                                                                                                 25
                                                    T                                                                                               26
                                                      T                                                                                             27
                                                        T                                                                                           28
                                                          T                                                                                         29
                                                            T                                                                                       30
                                                              T                                                                                     31
                                                                T                                                                                   32
                                                                  T                                                                                 33
                                                                    T                                                                               34
                                                                      T                                                                             35
                                                                        T                                                                           36
                                                                          T                                                                         37
                                                                            T                                                                       38
                                                                              T                                                                     39
  T                                         T                                   T                                                 T                 40
    T                                         T                                   T                                                 T               41
      T                                         T                                   T                                                 T             42
                                                                                      T                                                 T           43
          T                                                                             T                                                 T         44
                                                                                          T                                                         45
                                                                                            T                                                       46
                                                                                              T                                                     47
                                                                                                T                                                   48
                                                                                                  T                                                 49
                                                                                                    T                                               50
                                                                                                      T                                             51
                                                                                                        T                                           52
                                                                                                          T                                         53
                                                                                                            T                                       54
                                                                                                              T                                     55
                                                                                                                T                                   56
                                                                                                                  T                                 57
                                                                                                                    T                               58
                                                                                                                      T                             59
                                                                                                                        T                           60
                                                                                                                          T                         61
                                                                                                                            T                       62
                                                                                                                              T                     63
                                                                                                                                T                   64
  T                                         T                                   T                                                 T                 65
    T                                         T                                   T                                                 T               66
      T                                         T                                   T                                                 T             67
                                                                                      T                                                 T           68
          T                                                                             T                                                 T         69
                                                                                                                                            T T   T 70
                                                                                                                                            T T   T 71
T                                                                                                                                               T   72
                                                                                                                                            T T   T 73
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
Some texts have repetitive structures, such as song lyrics. Suddenly, the nearly straight path created by unique lines, blurs.

BUT since the files are identical, AND my first program searches the diagonals first, the first result indicates a cost of zero. And trying any other path, creates a cost, and it will terminate the search early.

HOWEVER, for the files that are not the same, and has many repeating lines within the same file, that is repeated in the other file, this will produce a lot of possible paths with similar costing. Unfortunately, along with the increased size of the file, this will result in many more evaluations. And if you look closely, the cells closer to lower right, get re-evaluated over and over again, when the algorithm returns to evaluating the unevaluated cells in upper-left.

Step 2. Same...

Except... for more than 1 solution, the first program tends to search for the same matching lines, in the same place. Optimizations can be added, assuming that from any matched lines, anywhere in the grid, to get to the bottom-right of envelope, is always the same. Code was added to save the results of any pathing requested from that matching line, to the bottom of envelope. So when pathing is done, starting from cells in upper-left, that will eventually reach the already evaluated match, it can respond with it's already saved pathing to the lower-right corner. This assumes that saving redundant steps, are more expensive than memory recall, which it is almost always, but not always. I believe this technique was referred to as "dynamic programming" (but I think it is a bad name for it).

Step 3: (Same as for revision 1)

Step 4. (Same as for revision 1)

--------- Added 4/14/2014-----------

The previous programs naively searched for paths (the red blocks), one cell at a time. BUT we actually already know all the possible paths, from pass 1. And since it is a 2d graph, we should be able to represent the line matches as vertices, using the match positions as coordinates, and using our already established rules to create a cost between those coordinates. Then use a spanning tree to construct the most relevant and shortest paths between the line matches. We are cutting out searching for red blocks, and replacing it with a spanning tree, and then searching it, for the path from top left of envelope, to bottom-right of envelope.

Revision 3

http://100.24.142.145/teach-diff.spantree.html

Step 1. Same...

You can still build the grid. I left it in there. It makes it easier to visualize. But you will slow your program down, and consume memory for it. It isn't necessary.

Except when we build the grid, we are replacing it with building a graph. But everywhere we created a match in the grid, we now save in a table.

So where these 2 files...

FileA

{
"departureStopName":
    "Metrotown Station @ Bay 4",
"departureStopCoor":
    {
"latitude":
    49.225794,
"longitude":
    -123.002518
},
"departureTime":
    [
2102,
2107,
2119,
2139,
2159,
2227,
2257,
2330
],
"departureTime":
    [
"9:02 PM",
"9:07 PM",
"9:19 PM",
"9:39 PM",
"9:59 PM",
"10:27 PM",
"10:57 PM",
"11:30 PM"
],
"departureTZ":
    "America/Vancouver",
"transitID":
    "TransLink",
"transitLine":
    "130",
"transitType":
    "BUS"
}
  
FileB

{
"departureStopName":
    "Metrotown Station @ Bay 4",
"departureStopCoor":
    {
"latitude":
    49.225794,
"longitude":
    -123.002518
},
"departureTime":
    [
2119,
2257,
2330,
2429
],
"departureTime":
    [
"9:19 PM",
"10:57 PM",
"11:30 PM",
"12:29 AM"
],
"departureTZ":
    "America/Vancouver",
"transitID":
    "TransLink",
"transitLine":
    "130",
"transitType":
    "BUS"
}
  

...used to produce this grid (I even added lines, so you can visualize a graph).
it is now changed into a graph, represented by only saving the points. And the point positions, are where 2 lines match, where the position in the files are. Just like their position in the grid.

T                                                                   0
  T                                                                 1
    T                                                               2
      T                                                             3
        T                                                           4
          T                                                         5
            T                                                       6
              T                                                     7
                T                                                   8
                  T                                                 9
                    T             T                                 10
                      T             T                               11
                                                                    12
                                                                    13
                        T                                           14
                                                                    15
                                                                    16
                                                                    17
                          T                                         18
                                                                    19
                                T             T                     20
                    T             T                                 21
                      T             T                               22
                                                                    23
                                                                    24
                                      T                             25
                                                                    26
                                                                    27
                                                                    28
                                        T                           29
                                                                    30
                                T             T                     31
                                                T                   32
                                                  T                 33
                                                    T               34
                                                      T             35
                                                        T           36
                                                          T         37
                                                            T       38
                                                              T     39
                                                                T   40
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 41
Some of you may notice, this is the start of a graph table. For those of you, the next steps will seem very similar to a spanning tree, kruskal's algorithm specifically.

So it is saved like below, first in table of matches, and table of cost between them, in first pass:

Table of vertices(matches)

FileA_pos

FileB_pos

0

0

1

1

...35 matching lines

Table of edges(all possible paths between vertices)

FileA_pos1

FileB_pos1

FileA_pos2

FileB_pos2

cost

0

0

1

1

0

...35x35-35 = 1190 lines

And then use the modified spanning tree algorithm, to get rid of edges that create a loop, incur a great cost, or violate the rules (you can only move down and right).

Now the pathing just needs to search thru shortest edges first, but doesn't examine each cell in the grid in those edges, to find the next match. If you throw away the edges used in the first pathing, you can start pathing on new start, until you have tried them all.

1190 "edges" are reduced to 34 paths that are relevant:

x1,y1,x2,y2,costX,costY,cost
0, 0, 1, 1, 0, 0, 0

1, 1, 2, 2, 0, 0, 0

2, 2, 3, 3, 0, 0, 0

3, 3, 4, 4, 0, 0, 0

4, 4, 5, 5, 0, 0, 0

5, 5, 6, 6, 0, 0, 0

6, 6, 7, 7, 0, 0, 0

7, 7, 8, 8, 0, 0, 0

8, 8, 9, 9, 0, 0, 0

9, 9, 10, 10, 0, 0, 0

10, 10, 11, 11, 0, 0, 0

10, 17, 11, 18, 0, 0, 0

20, 16, 21, 17, 0, 0, 0

21, 10, 22, 11, 0, 0, 0

21, 17, 22, 18, 0, 0, 0

31, 23, 32, 24, 0, 0, 0

32, 24, 33, 25, 0, 0, 0

33, 25, 34, 26, 0, 0, 0

34, 26, 35, 27, 0, 0, 0

35, 27, 36, 28, 0, 0, 0

36, 28, 37, 29, 0, 0, 0

37, 29, 38, 30, 0, 0, 0

38, 30, 39, 31, 0, 0, 0

39, 31, 40, 32, 0, 0, 0

11, 11, 14, 12, 2, 0, 2

22, 18, 25, 19, 2, 0, 2

14, 12, 18, 13, 3, 0, 3

18, 13, 20, 16, 1, 2, 3

25, 19, 29, 20, 3, 0, 3

29, 20, 31, 23, 1, 2, 3

9, 9, 10, 17, 0, 7, 7

21, 10, 22, 18, 0, 7, 7

31, 16, 32, 24, 0, 7, 7

18, 13, 20, 23, 1, 9, 10

You can still add the optimization where future paths are aborted early, if the cost of currently evaulated path, exceeds the previous path, which should be the already shortest path in the spanning tree from left edge to right.

Step 2. Same...

...except the pathing now only goes from matching line to matching line, and lines that is skipped in either file, is considered NOT MATCHING. Consecutive matching lines incur no cost in the paths between them. That is why there are so many cost:0 entries, left in the remaining 34 paths. It will clump together matching lines.

Edges (which is what a graph calls a path) representing links between vertices (which are matching lines) that are not consecutive, have a non-zero cost representing the number of deletions, and line additions that need to be added between the matching lines (the vertices ). costX is lines in FileA not in B. costY is lines in FileB not in A. cost is a number that the program tries to use to minimize.

The path with least cost, or least changes, is what is represented in the diff table between the two files. This is the same method used by the other 2 methods, but the 3rd revision doesn't keep searching the grid, but instead creates a table that represents a graph, which has all the possible ways that the vertices can connect to each other.

The spanning tree algorithm, simplifies this complicated graph, into a simpler graph with only the shortest spans, and where there are no loops, and eliminating the other edges. This is the graph we search thru for shortest path. And since the edges are the red blocks, we no longer keep searching thru them, after pass 1.

Step 3: Same

Step 4. Same

Have fun!

----------Added 4/16/2024---------

Here is Hunt-Szymanski, in a Javascript. I eventually found it. I tried
[a,b,c,d,e] vs [b,c,e,f,g] and it works for that case.

https://leastfixedpoint.com/files/lshift-blog-uploads/2006/08/hunt-mcilroy.js (Hunt and McIlroy 1976)

I read the code and the descriptions, and I can see if it doesn't seem to make sense at first glance. BUT It is actually simpler than my explanation of 3rd revision, but almost the same thing. But the code isn't explained, so it seems dense.

Step 1. Index the content of File2 into memory, and what line(s) it belongs in. That is, there is a function: f("line contents") = [lines in file2 that have that line].

The "matching line list" would be the "vertice" list in revision 3.

Step 2. Keep list of File1 line# and File2 line# pairs, which containing matching lines. This is initially empty. Here, we will refer to this as the matching line list.

Step3. Iterate each line in File1, and check the File2 index (created before in step1), if it exists (ie. f("{")=[line 0 in file 2]). That meants a result returned line 0 in File 1 is "{" and it exists in File2 as line 0. Then it checks that File2 line 0 comes after any record in the matching line list , mentioned in step2. The last sentence ensures that the File2 line# in the matching line list is advancing (File1 line is automatically advancing b/c the loop iterates ascendingly, so it will insert into matching line list ascendingly). If you understand my explanation for my algorithm, they are doing the same thing as the all my revision's Step2 rules for movement. If there is more than 1 line in file 2 with exact same line, it takes the next highest number, greater than the File2 line# in Step2 list/array.

Step 4. At the end, you have the Step2 list. It contains a list of advancing line# for File1 and File2.

Step 5. Follow the translation I gave you (in the Step3, for all my revisions). The Step2 array is complete by step 4. They give this a self important name, called Least Common Sequence(LCS). This LCS is a path, with actual line numbers instead of delta advances. The path only has matches, so every time the path advance both File1 and File2 line# by 1, then these are matches. Every time it doesn't, then these are not consecutive text blocks in both files. In exact same way, as my step 3, you need to show the line# in File1 AND the line# in File2, before you show the next match.

The unix comm is like a "intersection" utility. But only operates on 2 sorted files. It shows the lines that are common to both. An "inner join" in SQL, matches rows together. This might seem similar. They definitely differ when File1 and File2 has multiple exact same lines.

[A,A,A] comm [A,A] = [A,A]

[A,A,A] inner join [A,A] (on 2x 1 col table) = [[A,A],[A,A],[A,A],[A,A],[A,A],[A,A]] or [A,A,A,A,A,A]

inner joins on non-unique columns still produce a cross-product.

The clue in the comments is "comm", which is a unix utility that is supposed to be similar to a inner join. Except comm only works on sorted files, and the neither files are sorted. So this algorithm works like a "comm" utlity, if comm worked on non-sorted files, where order of the lines in the files matters.

Does it run faster than my version? Probably. They both have to iterate MxN to find matches. My algorithms still do a cost search after finding line matches, it doesn't. It assumes the LCS is the least cost.

It is ridiculously simple. I'm surprised I didn't see it at first.

Cut and paste it. It belongs to someone else. I take no responsibility for it. But I think I explain the idea of the diff, better. And why Hunt-Szymanski is so elegantly simple.

Leave a Reply

Your email address will not be published. Required fields are marked *