BGZF files are bigger than GZIP files, but they are much faster for random access.
Tucked away in samtools' sister tool set tabix is a little command line tool called bgzip (blocked gzip), which lets you compress any file in BGZF format. It also lets you decompress a BGZF file (nothing special here, they are a GZIP variant so you can use gunzip for this if you like). However, it can also decompress a requested region of the file - without the excessive IO required to do this with a traditional gzipped file.
This next paragraph assumes you are familiar with the programming ideas of seek and tell for low level file IO. If you seek on a file handle, it means jump to a particular point in the file - usually a simple offset which counts the number of bytes from the start of the file. If you then read some data, the position in the file changes. Calling tell tells you where in the file you are now. This idea also works for writing data (you can open files for writing with random access).
BGZF files (including BAM files) consist of many GZIP blocks, each limited to 64kb on disk and 64kb of compressed data. The virtual offset is an unsigned 64 bit integer which combines a 48 bit block start offset and a 16 bit within block offset. What this means is to get a particular bit of data from the BGZF file, you seek to a virtual offset, which is translated into a seek to the relevant BGZF block, which can be read and decompressed in memory (or incrementally), and then the within block offset tells you where the data requested begins. Just one disk seek is needed (except for the special case where the data continues into the subsequence block).
BGZF files are bigger than GZIP files! (Boo!)
BGZF files are bigger than traditional GZIP files. Each block requires its own header data, and compressing many units of data separately isn't as space efficient as compressing one large data file.
As an example of this size penalty, let's look at the UniProt/TrEMBL data files - the final column shows the extra disk space taken using BGZF rather than GZIP, and this is noticeable higher for the "SwissProt" plain text format and the XML files (which have a lot of repeated text which must be stored in each BGZF block):
|Uniprot SProt||FASTA||252.7 Mb||78.9 Mb||82.0 Mb||4%|
|(532,792 records)||SwissProt Text||2.52 Gb||448.6 Mb||534.9 Mb||19%|
|UniProt XML||5.26 Gb||773.2 Mb||896.0 Mb||16%|
|Uniprot TrEMBL||FASTA||7.94 Gb||3.50 Gb||3.58 Gb||2%|
|(17,651,715 records)||SwissProt Text||44.66 Gb||7.23 Gb||8.29 Gb||4%|
|UniProt XML||90.83 Gb||14.21 Gb||16.21||14%|
The GZIP sizes are for the files as downloaded from the EBI Mirror FTP site. I then decompressed them, and re-compressed as BGZF using bgzip from samtools. I presume that both the original GZIP file and the BGZIP file are using the same default ZLIB compression settings.
The 64kb block size currently used in BGZF is hurting it in the data compression of repeat rich files like XML, although this isn't an issue with compact records like FASTA, FASTQ, SAM, or BAM. This suggests an increased block size would benefit applying BGZF to XML file formats, with the significant downside that this would require a different virtual offset scheme.
BGZF files are better than GZIP files! (Yay!)
BGZF is intended to improve on GZIP for random access. For testing my use case is to randomly select some SProt record IDs, extract those records (in this random order) from the parent file, and write them to a new subset file. Note that for the XML file, you'd also want to include an XML header/footer. And I'm going to do this with a modified version of Biopython's index support in Bio.SeqIO, currently on my bgzf branch on github.
|Uniprot SProt (532,792 records)|
|FASTA||Size||252.7 Mb||78.9 Mb||82.0 Mb|
|Uniprot SProt (532,792 records)|
|SwissProt Text||Size||2.52 Gb||448.6 Mb||534.9 Mb|
|SwissProt Text||Extract 10||0s||77s||0s|
|SwissProt Text||Shuffle 532,792||172s||-||543s|
|Uniprot SProt (532,792 records)|
|UniProt XML||Size||5.26 Gb||773.2 Mb||896.0 Mb|
|UniProt XML||Extract 10||0s||157s||0s|
|UniProt XML||Shuffle 532,792||427s||-||934s|
The times are all in seconds, rounded to the nearest whole number. The "shuffle" time is the combination of indexing the file plus extracting all the records in the random order. The "-" entries are where I was not patient enough to let the GZIP random access task complete. The copy/decompress time used the Unix commands time, cat and for the GZIP and BGZF files gunzip to give an idea of the minimal disk IO and decompression time if we didn't need random access.
The yellow rows of the table should convincingly make the point is while indexing and random access to a large BGZF file is slower than the uncompressed file, it is reasonable. But trying to do this with a GZIP file is not a good idea. BGZF clearly trounces GZIP in random access.
Now there is a small catch - you can't do this with Biopython at home yet (update - you can with Biopython 1.60 onwards). Currently the Bio.SeqIO code doesn't support indexing compressed files. Indexing gzipped files was easy to add (we knew from earlier work performance would be poor, which is why it was never included), but support for BGZF was a bit harder to implement for this experiment. The main problem was the current code did things like taking the difference between start and end offsets to get an entry's length. You can't take the difference of BGZF virtual offsets.
Note that the "official" Bio.SeqIO code (which currently only does uncompressed files) could be a little faster to build the indexes because it assumes a normal file handle where the offsets behave naturally. Also, I'm deliberatly just timing the smaller Uniprot SProt database, because it has few enough records to safely work with the index in memory. We could ask Biopython to index Uniprot TrEMBL with an SQLite3 databse for the offsets, but it isn't necessary to demonstrate my point here, which is the raw IO side of things, not the indexing scheme.
Another proviso is it may be possible to improve the GZIP speed. I'm just using the Python gzip library via gzip.open(...) and the seek/tell methods it provides, so any optimisation or lack of it falls to the Python standard library. Interally this calls the Python zlib library, which is what I'm using to implement BGZF in Biopython. On the other hand, I'm pretty sure that my BGZF code can be sped up after some profiling, which I haven't tried yet.
First the following quick shell commands extracted all the record names from the FASTA file (using grep and cut), randomised them (using a little Perl), and saved them as a plain text one name per line.
$ time grep "^>" uniprot_sprot.fasta | cut -c 2- | cut -f 1 -d " " | perl -MList::Util=shuffle -e 'print shuffle(<stdin>);' > uniprot_sprot.fasta.rnd real 0m3.570s user 0m5.902s sys 0m0.523s $ time cut -d "|" -f 2 uniprot_sprot.fasta.rnd > uniprot_sprot.swiss.rnd real 0m0.361s user 0m0.329s sys 0m0.028s
Lines wrapped at the pipe and redirection characters. These are the input files for my random access trails - the two ID files handle the different ID formats used by default for the different filetypes in Biopython (e.g. tr|F2D7G4|F2D7G4_HORVD from the FASTA header vs just F2D7G4 from the SwissProt plain text file).
The Python code boils down to just some out loops to set filenames and formats, then builds an in memory index (with the experimental support for GZIP and BGZF compression), and another loop to time extracting various records from the file. The timings shown above are the best run of three:
import time from Bio import SeqIO start = time.time() index = SeqIO.index(filename, format, compression=comp) time_index = time.time()-start print "SeqIO.index(%r, %r, compression=%r) took %0.1fs" \ % (filename, format, comp, time_index)
Then I have a generator function to load the identifiers from the text file created earlier:
def load_ids(filename, limit = 10): count = 0 with open(filename) as handle: for line in handle: name = line.strip() if name: yield name count += 1 if count == limit: break
And finally, we time using the index to extract various number of the records
for lim in [10,50,100,500,1000,5000,100000,500000,len(index)]: start = time.time() handle = open(out_filename, "w") for name in load_ids(id_filename, lim): handle.write(index.get_raw(name)) handle.close() time_write = time.time()-start print "Writing %i in random order took %0.1fs" \ % (limit, time_write) if lim == len(index): print "Shuffling %i took %0.1fs" \ % (limit, time_index + time_write)
This requires the BGZF and GZIP support currently on my bgzf branch on github (update - you can with Biopython 1.60 onwards). That's about it.
BGZF is bigger but better than GZIP for random access
I'm still tweaking the BGZF code but will probably propose merging it into Biopython later this month.
However, the main point of this blog post was to try to raise awareness of BGZF as a suitable alternative to GZIP for when you want to compress a data file while retaining reasonably fast random access:
BGZF files are bigger than GZIP files, but they are much faster for random access.For High Throughput Sequencing (HTS or HTSeq, aka Next Gen Seq or NGS) data, I still believe you should just use SAM/BAM for your unaligned reads, but for FASTQ / FASTA / QUAL files using BGZF is a more flexible alternative to plain GZIP (with a fairly small size overhead).
Also on the plus side, separate self-contained GZIP blocks as used in BGZF also offer the possibility of multi-threading decompression and data analysis at the block level, in addition to the random access benefits I've been focusing on here today.
If you are interested in the Biopython developement side, please join the Biopython-dev mailing list and read this thread, while for any discussion/feedback/comments on the general use of BGZF, I've started a thread on SEQanswers.com for this.
Update (June 2012)
Biopython 1.60 includes module Bio.bgzf for working with BGZF files, and Bio.SeqIO's indexing functions will use this to support BGZF compressed sequence files.