A Python implementation of the interval binning scheme ====================================================== These are some utility functions for working with the interval binning scheme as used in the `UCSC Genome Browser `_. This scheme can be used to implement fast overlap-based querying of intervals, essentially mimicking an `R-tree `_ index. .. note:: Some database systems natively support spatial index methods such as R-trees. See for example the `PostGIS `_ extension for PostgreSQL. Although in principle the method can be used for binning any kind of intervals, be aware that the largest position supported by this implementation is :math:`2^{29}` (which covers the longest human chromosome). Usage ----- Let's say you have a set of intervals `I` in a database system without support for spatial indexing. Querying `I` on overlap with an interval `q` can be done as: .. math:: \{ i \in I ~|~ \mathrm{overlapping}(i, q) \} \text{where} \mathrm{overlapping}(i, q) = i.start < q.stop \wedge i.stop > q.start But this will be slow, even with normal B-tree indexes on `start` and `stop`. If for each interval `i`, we also store its bin as given by :func:`.assign_bin` (and we index it), we can get the same result much faster by pre-filtering on :func:`.overlapping_bins`: .. math:: \{ i \in I ~|~ i.bin \in \mathrm{overlapping\_bins}(q) \wedge \mathrm{overlapping}(i, q) \} Similarly, if `i` must completely contain `q` (or vice versa), you can use :func:`.containing_bins` (or :func:`.contained_bins`). SQLAlchemy example ^^^^^^^^^^^^^^^^^^ As a more concrete example, let's consider the following `SQLAlchemy `_ model definition for storing gene locations:: class Gene(Base): name = Column(String, primary_key=True) start = Column(Integer) stop = Column(Integer) bin = column(Integer, index=True) def __init__(self, name, start, stop): self.name = name self.start = start self.stop = stop self.bin = binning.assign_bin(start, stop) The `bin` column is populated using :func:`.assign_bin` and has an index. We can ask for all genes spanning the 50,000-50,500 interval:: >>> session.query(Gene).filter(Gene.start < 50000, ... Gene.stop > 50500).count() 78 But that query will be slow, yielding a sequential table scan. Adding indexes on `Gene.start` or `Gene.stop` will not help much. The :func:`.containing_bins` function gives us all the bins potentially containing genes spanning some interval. We can use that to filter on `Gene.bin` first:: >>> bins = binning.containing_bins(50000, 50500) >>> session.query(Gene).filter(Gene.bin.in_(bins), ... Gene.start < 50000, ... Gene.stop > 50500).count() 78 This query will be much faster because it can use the index on `Gene.bin`. The filter on bin only gives us a crude pre-selection though, so we still have to apply the `Gene.start` and `Gene.stop` filters on the (relatively small) resulting set of genes to get the exact answer. Installation ------------ To install the latest release via PyPI using pip:: pip install interval-binning The latest development version `can be found on GitHub `_. API documentation ----------------- .. module:: binning All positions and ranges in this module are zero-based and open-ended, following standard Python indexing and slicing. .. autofunction:: assign_bin .. autofunction:: overlapping_bins .. autofunction:: containing_bins .. autofunction:: contained_bins .. autofunction:: covered_interval Copyright --------- This library is licensed under the MIT License, meaning you can do whatever you want with it as long as all copies include these license terms. The full license text can be found in the LICENSE.rst file. See the AUTHORS.txt for for a complete list of copyright holders.