Commit caa4ba38 authored by Andreas Kusalananda Kähäri's avatar Andreas Kusalananda Kähäri
Browse files

Remove old unused code for binary deltas.

parent 1a2aaa79
$Id$
Andreas Kahari, andreas.kahari@ebi.ac.uk
======================================================================
About "apply.pl"
======================================================================
The apply.pl program is a Perl script that will run on any Unix system
with a Perl interpreter installed (along with some common non-standard
Perl modules). It also makes use of the xdelta program (more on this
below).
The program will patch an older release of an Ensembl database
into a newer release by applying binary "delta files" created
by the build.pl program (discussed elsewhere). The delta files
are applied to the text dumps of the MySQL database files (as
found on ftp://ftp.ensembl.org/pub/<species>/data/mysql/) and will
thus incorporate any schema changes as well as data changes. It is
hoped that the process of downloading the delta files and applying
them to the older release of the text dump files on an external site
will be much quicker than downloading the complete new release.
Given a directory of delta files created by build.pl and a directory
containing the correct and untouched old revision of a database,
apply.pl will create a new directory an populate it with the new
revision of the files.
The new files may be loaded into MySQL as described in the "Installing
the Ensembl Data" section of the Ensembl installation instructions.
Requirements / Configuration
To work, apply.pl needs the following components, which are usually
not part of your every-day Unix-based system.
1. The xdelta program (version 1.1.3, not version 2),
http://sourceforge.net/projects/xdelta/
2. The following Perl modules, some available as standard modules,
others available from CPAN at http://www.cpan.org/
* Compress::Zlib
* Digest::MD5
* File::Basename
* File::Copy
* Getopt::Std
Check your distribution CDs before downloading and installing these
prerequisits from the web.
Usage
Running the apply.pl program without any arguments generates the
following informational text (or something very similar to it):
Usage: ./apply.pl [options] [--] database old_v new_v
database The database to work on, e.g. "homo_sapiens_core".
old_v The older version, e.g. "11_31".
new_v The newer version, e.g. "12_31".
The options may be any of these:
-c cmd Path to xdelta executable.
Default: "xdelta".
-s path Path to the directory where the delta directory is stored.
Default: "."
-d path Path to the directory holding the old version of the
database, and where the new version of the database
should be created. The new database directory will be
given a unique name.
Default: "."
Assuming the current directory holds a sub-directory containing the
11_31 release of e.g. the homo_sapiens_core Ensembl database, and
another sub-directory containing the delta files, the 12_31 release
may be created by doing this:
./apply.pl homo_sapiens_core 11_31 12_31 | tee apply.out
Note that the three non-optional arguments are exactly the same
as those used with build.pl to create the delta files. The
delta files for this example are assumed to be available in the
./homo_sapiens_core_11_31_delta_12_31 directory. Release 11_31 is
assumend to be available as ./homo_sapiens_core_11_31 and the result
will go in ./homo_sapiens_core_12_31 (if that directory doesn't
already exist).
The "| apply.out" bit is only needed if you want to store the output
of the apply.pl in a separate file, "apply.out" in this case.
Alternate locations for the xdelta executable etc. may be specified
using the options shown above.
The apply.pl program will verify the MD5 checksums of all files
involved in the patch, including the delta files. The patching will
fail and the whole process will be aborted if any checksum fails.
This means that files modified on the external site can not be updated
in this way.
Note: If a file is compressed (*.gz), the checksum and size of the
patched file will *not* be verified. This is because xdelta will
recompress the file which in most cases will result in a compressed
file that is slightly different from the original file. This also
means that the checksums in the CHECKSUM.gz file in most cases can not
be used. However, the xdelta program does still do its own MD5
checksum verification.
vim: et
#!/usr/bin/perl -w
# $Id$
#
# apply.pl
#
# A program that uses a previously created set of binary delta
# files to produce a new revision of an ensembl database out of
# an older revision of the same database. The delta files must
# have been built with the build.pl Perl program.
#
# See also apply.README
#
# Author: Andreas Kahari, <andreas.kahari@ebi.ac.uk>
#
use strict;
use warnings;
use File::Basename;
use File::Copy;
use Getopt::Std;
use Digest::MD5;
use Compress::Zlib;
# Compute the MD5 checksum of a file. Returns the checksum as a
# hex string.
sub make_checksum
{
my $file_path = shift;
my $digest = new Digest::MD5;
open FILE, $file_path or die $!;
binmode FILE;
$digest->addfile(*FILE);
my $hex = $digest->hexdigest;
close FILE;
return $hex;
}
# Converts a byte count to a form more easly read by humans.
# Returns a string consisting of an float (two decimal places),
# a space, and a suffix.
sub make_human_readable
{
my $bytes = shift;
my @prefix = qw(b Kb Mb Gb Tb Pb);
my $step = 0;
while ($bytes > 10000) {
$bytes /= 1024;
++$step;
}
return sprintf("%.2f %s", $bytes, $prefix[$step]);
}
# Display usage information for the apply.pl program.
sub usage_apply
{
my $opts = shift;
print STDERR <<EOT;
Usage: $0 [options] [--] database old_v new_v
database The database to work on, e.g. "homo_sapiens_core".
old_v The older version, e.g. "11_31".
new_v The newer version, e.g. "12_31".
The options may be any of these:
-c cmd Path to xdelta executable.
Default: "$opts->{'c'}".
-s path Path to the directory where the delta directory is stored.
Default: "$opts->{'s'}"
-d path Path to the directory holding the old version of the
database, and where the new version of the database
should be created. The new database directory will be
given a unique name.
Default: "$opts->{'d'}"
EOT
}
# Decompress a file.
sub do_decompress
{
my $zfile_path = shift;
my $file_path = shift;
open(OUT, '>' . $file_path) or die $!;
binmode OUT;
my $gz = gzopen($zfile_path, "r");
if (!defined($gz)) {
close OUT;
die $gzerrno;
}
my $buffer;
while ((my $bytesread = $gz->gzread($buffer)) != 0) {
print OUT substr($buffer, 0, $bytesread);
}
$gz->gzclose();
close OUT;
}
my %opts;
my $xdelta_cmd = $opts{'c'} = 'xdelta';
my $src_prefix = $opts{'s'} = '.';
my $dst_prefix = $opts{'d'} = '.';
if (!getopts('c:s:d:', \%opts)) {
usage_apply(\%opts);
die;
}
$xdelta_cmd = $opts{'c'};
$src_prefix = $opts{'s'};
$dst_prefix = $opts{'d'};
if ($#ARGV != 2) {
usage_apply(\%opts);
die;
}
my $db = $ARGV[0];
my $v1 = $ARGV[1]; my $v1_dir = sprintf "%s/%s_%s", $dst_prefix, $db, $v1;
my $v2 = $ARGV[2]; my $v2_dir = sprintf "%s/%s_%s", $dst_prefix, $db, $v2;
my $delta_dir = sprintf "%s/%s_%s_delta_%s", $src_prefix, $db, $v1, $v2;
die "$v1_dir: $!" if (! -d $v1_dir);
die "$delta_dir: $!" if (! -d $delta_dir);
while (-d $v2_dir) {
$v2_dir = sprintf "%s.%04d", $v2_dir, int(rand(10000));
}
printf STDERR "Creating the directory '%s'\n", $v2_dir;
mkdir($v2_dir) or die $!;
my $v1_all_size = 0;
my $v2_all_size = 0;
my $delta_all_size = 0;
foreach my $info_file (glob($delta_dir . '/*.info')) {
my $base_name = basename($info_file);
$base_name =~ s/\.info$//;
my $v1_file = sprintf "%s/%s", $v1_dir, $base_name;
my $v2_file = sprintf "%s/%s", $v2_dir, $base_name;
my $delta_file = sprintf "%s/%s", $delta_dir, $base_name;
printf "Processing '%s'\n", $base_name;
open(INFO, $info_file) or die $!;
my $patch_command = <INFO>; chomp $patch_command;
my $v1_line = <INFO>; chomp $v1_line;
my ($v1_sum, $v1_size) = split /\s+/, $v1_line;
my $v2_line = <INFO>; chomp $v2_line;
my ($v2_sum, $v2_size) = split /\s+/, $v2_line;
my $delta_line = <INFO>; chomp $delta_line;
my ($delta_sum, $delta_size) = split /\s+/, $delta_line;
close INFO;
if ($v1_sum ne '(none)' && $v1_sum ne make_checksum($v1_file)) {
print "\tChecksum mismatch for old file\n";
print "\tCan not continue\n";
die;
} elsif ($v1_sum ne '(none)' && $v1_size != (stat $v1_file)[7]) {
print "\tSize mismatch for old file\n";
print "\tCan not continue\n";
die;
} else {
print "\tChecksum and size ok for old file\n";
}
if ($delta_sum ne '(none)' && $delta_sum ne make_checksum($delta_file)) {
print "\tChecksum mismatch for delta file\n";
print "\tCan not continue\n";
die;
} elsif ($delta_sum ne '(none)' && $delta_size != (stat $delta_file)[7]) {
print "\tSize mismatch for delta file\n";
print "\tCan not continue\n";
die;
} else {
print "\tChecksum and size ok for delta file\n";
}
if ($patch_command eq 'PATCH') {
print "\tPatching file\n";
system($xdelta_cmd, 'patch', $delta_file, $v1_file, $v2_file);
} elsif ($patch_command eq 'COPY') {
print "\tCopying old file\n";
copy($v1_file, $v2_file);
} elsif ($patch_command eq 'ZIP') {
print "\tDecompressing compressed file\n";
do_decompress($delta_file, $v2_file);
} elsif ($patch_command eq 'ADD') {
print "\tAdding new file\n";
copy($delta_file, $v2_file);
} else {
warn "\tStrange patch command: $patch_command\n";
}
if (!($v2_file =~ /\.gz$/ && $patch_command eq 'PATCH')) {
if ($v2_sum ne '(none)' && $v2_sum ne make_checksum($v2_file)) {
print "\tChecksum mismatch for new file\n";
print "\tCan not continue\n";
die;
} elsif ($v2_sum ne '(none)' && $v2_size != (stat $v2_file)[7]) {
print "\tSize mismatch for new file\n";
print "\tCan not continue\n";
die;
} else {
print "\tChecksum and size ok for new file\n";
}
} else {
print "\tNew file is compressed (*.gz), " .
"will not verify checksum/size\n";
}
}
#!/bin/ksh -ex
# $Id$
#
# Creates delta files between all consecutive revisions of all
# databases on ftp.ensembl.org using build.pl and apply.pl.
#
# Author: Andreas Kahari <andreas.kahari@ebi.ac.uk>
#
export LANG=C
ftpsite='ftp.ensembl.org'
ftppass='${LOGNAME}@$(hostname).$(domainname)'
dbdir='./databases'
deltadir='./deltas'
build_cmd='./build.pl'
apply_cmd='./apply.pl'
time_cmd='/usr/bin/time'
xdelta_cmd='./xdelta.osf'
perl_cmd='/usr/local/ensembl/bin/perl -w'
trapsigs="INT HUP TERM"
#-------------------------------------------------------------
# Function: file_list
# Usage: file_list
#
# Downloads the ls-lR.Z file off the FTP site into the current
# working directory. Extracts the names of the files that we are
# interested in from it and outputs them on standard output. The
# format of the output is "path dbname version", where "path" is
# the path of the FTP directory, "dbname" is the name of the
# database, and "version" is the version of the database. The
# output is sorted on "dbname", then on "version".
function file_list
{
ftp -i -n -v <<-EOT >/dev/null
open ${ftpsite}
user anonymous ${ftppass}
binary
get ls-lR.Z
EOT
gunzip -c ls-lR.Z |
grep 'data/mysql/.*[0-9][0-9]*_[0-9][0-9]*' | grep -v 'pub/NEW' |
sed -n 's/^\(.*\)\/\([^\/]*\)_\([0-9][0-9]*_[0-9][0-9]*.*\):$/\1 \2 \3/p' |
sort -k2,2 -k3,3
}
#-------------------------------------------------------------
# Function: cleanup
# Usage: cleanup dbname [version]
#
# Remove a downloaded database from ${dbdir}. If "version" is
# ommited, remove all versions of the database.
function cleanup
{
typeset dbname=$1
typeset version=${2:-'*'}
if [[ ! -d ${dbdir} ]]; then
return
fi
rm -f -r ${dbdir}/${dbname}_${version}
}
#-------------------------------------------------------------
# Function: fetch_db
# Usage: fetch_db path dbname version
#
# Fetches version "version" of the database "dbname" at the
# path "path" off the ${ftpsite}. The database will be stored
# in ${dbdir}.
function fetch_db
{
typeset path=$1
typeset dbname=$2
typeset version=$3
if [[ -d ${dbdir}/${dbname}_${version} ]]; then
return
fi
mkdir -p ${dbdir}/${dbname}_${version}
trap "rm -rf ${dbdir}/${dbname}_${version}; exit 1" ${trapsigs}
(
cd ${dbdir}
ftp -i -n -v <<-EOT
open ${ftpsite}
user anonymous ${ftppass}
binary
cd ${path}
mget ${dbname}_${version}/*
EOT
)
trap - ${trapsigs}
}
#-------------------------------------------------------------
# Function: build_delta
# Usage: build_delta dbname opath oversion path version
#
# Records the changes between version "oversion" and "version"
# of database "dbname". Also tests the generated delta files.
function build_delta
{
typeset dbname=$1
typeset opath=$2
typeset oversion=$3
typeset path=$4
typeset version=$5
typeset outdir=${deltadir}/to_${version%_*[0-9]*}
mkdir -p ${outdir}
typeset bout=${outdir}/${dbname}_${oversion}_delta_${version}_build.out
typeset aout=${outdir}/${dbname}_${oversion}_delta_${version}_apply.out
if [[ ! -f ${bout} ]]; then
fetch_db ${opath} ${dbname} ${oversion}
fetch_db ${path} ${dbname} ${version}
trap "rm ${bout}; exit 1" ${trapsigs}
${time_cmd} ${perl_cmd} ${build_cmd} -c ${xdelta_cmd} \
-s ${dbdir} -d ${outdir} \
${dbname} ${oversion} ${version} | tee ${bout}
trap - ${trapsigs}
fi
if [[ ! -f ${aout} ]]; then
trap "rm ${aout}; exit 1" ${trapsigs}
${time_cmd} ${perl_cmd} ${apply_cmd} -c ${xdelta_cmd} \
-d ${dbdir} -s ${outdir} \
${dbname} ${oversion} ${version} | tee ${aout}
trap - ${trapsigs}
fi
}
#-------------------------------------------------------------
file_list |
while read path dbname version; do
if [[ -n ${odbname} ]]; then
if [[ ${odbname} != ${dbname} ]]; then
cleanup ${odbname}
opath=${path}
odbname=${dbname}
oversion=${version}
continue
fi
build_delta ${dbname} ${opath} ${oversion} ${path} ${version}
cleanup ${dbname} ${oversion}
fi
opath=${path}
odbname=${dbname}
oversion=${version}
done
$Id$
Andreas Kahari, andreas.kahari@ebi.ac.uk
======================================================================
About "build.pl"
======================================================================
The build.pl program is a Perl script that will run on any Unix system
with a Perl interpreter installed (along with some common non-standard
Perl modules). It also makes use of the xdelta and gzip programs
(more on this below).
The program will compute binary "delta files" that may be used for
upgrading from one release of an Ensembl database to the next one.
The deltas are computed from the text dumps of the MySQL database
files (as found on ftp://ftp.ensembl.org/pub/<species>/data/mysql/)
and will thus incorporate any schema changes as well as data
changes. It is hoped that the process of downloading the delta files
and applying them to the older release of the text dump files on an
external site will be much quicker than downloading the complete new
release.
The new release may then be acuired by applying the delta files on the
old release using the apply.pl program (discussed elsewhere) and
loading the result into MySQL as described in the "Installing the
Ensembl Data" section of the Ensembl installation instructions.
Requirements / Configuration
To work, build.pl needs the following components, which are usually
not part of your every-day Unix-based system.
1. The xdelta program (version 1.1.3, not version 2),
http://sourceforge.net/projects/xdelta/
2. The following Perl modules, some available as standard modules,
others available from CPAN at http://www.cpan.org/
* Compress::Zlib
* Digest::MD5
* File::Basename
* Getopt::Std
3. The GNU zip (gzip) program.
Check your distribution CDs before downloading and installing these
prerequisits from the web.
Usage
Running the build.pl program without any arguments generates the
following informational text (or something very similar to it):
Usage: ./build.pl [options] [--] database old_v new_v
database The database to work on, e.g. "homo_sapiens_core".
old_v The older version, e.g. "11_31".
new_v The newer version, e.g. "12_31".
The options may be any of these:
-c cmd Path to xdelta executable.
Default: "xdelta".
-s path Path to the directory where the databases are stored.
Default: "."
-d path Path to the directory within which the delta
directory should be created.
Default: "."
To create the delta files containing all changes between the 11_31
release and the 12_31 release of the homo_sapiens_core database
located in the current directory, do this:
./build.pl homo_sapiens_core 11_31 12_31 | tee build.out
This creates a third directory called, in this case,
homo_sapiens_core_11_31_delta_12_31 (in the current directory) into
which the generated delta files will be put. In this case it is
assumed that the older revision is kept in ./homo_sapiens_core_11_31
and the newer revision is kept in ./homo_sapiens_core_12_31.
The "| tee build.out" bit ensures that the output that the program
produces (which includes statistics about how much space that was
saved for each file etc.) is both displayed in the console and saved
to the specified file, "build.out" in this case.
To specify alternate locations for the databases, the generated files,
or for the xdelta executable, use the -s, -d, and -c switches
respectively as described above.
For each file in the both releases, the program will check whether the
file