File comparison

Editing documents, program code, or any data always risks introducing errors. Displaying the differences between two or more sets of data, file comparison tools can make computing simpler, and more efficient by focusing on new data and ignoring what did not change. Generically known as a diff^[1] after the Unix diff utility, there are a range of ways to compare data sources and display the results.

Some widely used file comparison programs are diff, cmp, FileMerge, WinMerge, Beyond Compare, and File Compare.

Because understanding changes is important to writers of code or documents, many text editors and word processors include the functionality necessary to see the changes between different versions of a file or document.

Method types

The most efficient method of finding differences depends on the source data, and the nature of the changes. One approach is to find the longest common subsequence between two files, then regard the non-common data as an insertion, or a deletion.

In 1978, Paul Heckel published an algorithm that identifies most moved blocks of text.^[2] This is used in the IBM History Flow tool.^[3] Other file comparison programs find block moves.^{[clarification needed]}

Some specialized file comparison tools find the longest increasing subsequence between two files.^[4] The rsync protocol uses a rolling hash function to compare two files on two distant computers with low communication overhead.

File comparison in word processors is typically at the word level, while comparison in most programming tools is at the line level. Byte or character-level comparison is useful in some specialized applications.

Display

The optimal way to display the results of a file comparison depends on many factors, including the type of source data. The fixed lines of programming code provide a clear unit of comparison. This does not work with documents, where adding a single word may cause the following lines to wrap differently, but still not change the content.

The most popular ways to display changes are either side-by-side, or a consolidating view that highlights data inserts, and deletes. In either side-by-side viewing, code folding or text folding, for the sake of efficiency, the interface may hide portions of the file that did not change and show only the changes.^{[clarification needed]}

Reasoning

There are various reasons to use comparison tools, and tools themselves use different approaches. To compare binary files, a tool may use byte-level comparison. Comparing text files or computer programs, many tools use a side-by-side visual comparison.^[5] This gives the user the chance to choose which changes to keep or reject before merging the files into a new version.^[6] Or perhaps to keep them both as-is for later reference, through some form of "versioning" control.

File comparison is an important, and integral process of file synchronization and backup. In backup methodologies, the issue of data corruption is important. Rarely is there a warning before corruption occurs, this can make recovery difficult or impossible. Often, the problem is only apparent the next time someone tries to open a file. In this circumstance, a comparison tool can help to isolate the introduction of the problem.^[7]

Historical uses

Prior to file comparison, machines existed to compare magnetic tapes or punch cards. The IBM 519 Card Reproducer could determine whether a deck of punched cards were equivalent. In 1957, John Van Gardner developed a system to compare the check sums of loaded sections of Fortran programs to debug compilation problems on the IBM 704.^[8]

References

^ "diff", The Jargon File
^ Heckel, Paul (1978), "A Technique for Isolating Differences Between Files" (PDF), Communications of the ACM, 21 (4): 264–268, doi:10.1145/359460.359467, S2CID 207683976, retrieved 2011-12-04
^ Viégas, Fernanda B.; Wattenberg, Martin; Kushal, Kushal Dave (2004), Studying Cooperation and Conflict between Authors with history flow Visualizations (PDF), vol. 6, Vienna: CHI, pp. 575–582, retrieved 2011-12-01
^ Liwei Ren; Jinsheng Gu; Luosheng Peng (18 April 2006). "Algorithms for block-level code alignment of software binary files". Google Patents. USPTO. Retrieved 10 May 2019.
^ MacKenzie, David; Eggert, Paul; Stallman, Richard (2003). Comparing and Merging Files with Gnu Diff and Patch. Network Theory. ISBN 978-0-9541617-5-0.
^ "File comparison software: vc-dwim and vc-chlog". www.gnu.org. Retrieved 2023-04-16.
^ "SystemRescue - System Rescue Homepage". www.system-rescue.org. Retrieved 2023-04-16.
^ John Van Gardner. "Fortran And The Genesis Of Project Intercept" (PDF). Retrieved 2011-12-06.

External links

[1] "diff", The Jargon File

[2] Heckel, Paul (1978), "A Technique for Isolating Differences Between Files" (PDF), Communications of the ACM, 21 (4): 264–268, doi:10.1145/359460.359467, S2CID 207683976, retrieved 2011-12-04

[3] Viégas, Fernanda B.; Wattenberg, Martin; Kushal, Kushal Dave (2004), Studying Cooperation and Conflict between Authors with history flow Visualizations (PDF), vol. 6, Vienna: CHI, pp. 575–582, retrieved 2011-12-01

[PatentUS7031972B2-4] Liwei Ren; Jinsheng Gu; Luosheng Peng (18 April 2006). "Algorithms for block-level code alignment of software binary files". Google Patents. USPTO. Retrieved 10 May 2019.

[5] MacKenzie, David; Eggert, Paul; Stallman, Richard (2003). Comparing and Merging Files with Gnu Diff and Patch. Network Theory. ISBN 978-0-9541617-5-0.

[6] "File comparison software: vc-dwim and vc-chlog". www.gnu.org. Retrieved 2023-04-16.

[7] "SystemRescue - System Rescue Homepage". www.system-rescue.org. Retrieved 2023-04-16.

[8] John Van Gardner. "Fortran And The Genesis Of Project Intercept" (PDF). Retrieved 2011-12-06.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Computer files
Types	Binary file / text file Data file File format List of file formats File signatures Magic number Open file formats Proprietary file formats Metafile Sidecar file Sparse file Swap file System file Temporary file Zero-byte file
Properties	Filename 8.3 filename Long filename Filename mangling Filename extension List of filename extensions File attribute Extended file attributes File size Hidden file / Hidden directory
Organisation	Directory/folder NTFS links Temporary folder Directory structure File system Filesystem Hierarchy Standard Grid file system Semantic file system Path
Operations	Open Close Read Write
Linking	File descriptor Hard link Shortcut Alias Shadow Symbolic link
Management	Backup File comparison File copying Data compression File manager Comparison of file managers File system fragmentation File-system permissions File transfer File sharing File synchronization File verification