A technical intro to Ibis: The portable Python DataFrame library
How Ibis simplifies analytics across multi-backend systems with a unified Python API
We recently explored Ibis, a Python library designed for working with data across different storage systems and processing engines. Putting together this blog to condense technical insights into how Ibis works, its features, and some potential applications, with the goal of providing a clear understanding for those curious about tools that enable multi-backend data workflows.
Ibis offers a DataFrame-like API, similar to Pandas, but works by translating Python operations into backend-specific queries. This approach allows it to interact with various systems like SQL databases, analytical engines (e.g., BigQuery or DuckDB), and even in-memory tools like Pandas.
Why is this useful?
Before diving into the specifics of Ibis, let’s step back and consider the typical challenges of working with data at scale.
Data analysis in Python often involves tools like Pandas for in-memory operations or SQL for querying databases. But as datasets grow or span across multiple backends (e.g., PostgreSQL, DuckDB, BigQuery), these tools can become cumbersome to manage. Ibis is a Python library designed to bridge this gap by offering a consistent, portable API for interacting with diverse backends.
Fragmentation. Data is often stored across multiple systems. Transactional databases, analytical warehouses, and distributed storage engines, each requiring a different approach for querying and analysis.
Scalability. Tools like Pandas are great for in-memory computations but fall short when datasets outgrow available memory.
Redundant logic. Writing the same transformation logic in Python, SQL, and other languages for different systems introduces inefficiencies and opportunities for error.
Ibis attempts to solve these problems by acting as a middle layer that standardizes interaction with various backends. Its lazy evaluation model and backend translation layer allow developers to write backend-agnostic code, with the backend handling execution in the most optimized way possible.
How it works
At a high level, Ibis provides a way to define analytical workflows in Python and delegate their execution to various backends. Its architecture is designed to balance user simplicity with backend optimization. Here’s how it works:
1. API Layer
Ibis provides a Pythonic API that resembles Pandas, with methods like .filter()
, .mutate()
, .groupby()
, and .aggregate()
. This abstraction makes it easy for developers to describe transformations without worrying about how they will be executed.
2. Translation Layer
Once operations are defined, Ibis translates them into queries or transformations native to the backend. For example:
A
.filter()
operation in Ibis translates to aSQL WHERE
clause for SQL backends.For in-memory engines like Pandas, Ibis performs the operations directly in Python.
3. Execution Layer
The actual execution happens only when explicitly triggered with .execute(). This lazy evaluation model optimizes queries only if the necessary data is retrieved or processed.
Core Features
1. Multi-Backend Support
Ibis supports several backends, including:
SQL Databases: PostgreSQL, MySQL, SQLite
Analytical Engines: BigQuery, ClickHouse, DuckDB
Distributed Systems: PySpark, Dask (experimental)
In-Memory Tools: Pandas
This flexibility allows users to define their workflows in a backend-agnostic way, switching between systems with minimal changes.
2. Lazy Evaluation
Operations in Ibis are evaluated lazily. This means that transformations like filtering or aggregating data are not executed immediately but are instead compiled into a query plan. The actual execution happens only when .execute() is called. For example:
import ibis
# Connect to DuckDB
con = ibis.duckdb.connect("example.db")
# Define a query
table = con.table("sales")
query = table.filter(table.revenue > 100).mutate(profit=table.revenue - table.cost)
# No execution yet
print(query) # Outputs an Ibis expression object
# Trigger execution
result = query.execute()
print(result)
Lazy evaluation allows Ibis to optimize queries before execution, leveraging the backend's capabilities.
3. Unified API
The Ibis API is consistent across backends, allowing the same code to be used with different systems. For example:
# Querying with PostgreSQL
pg_con = ibis.postgres.connect(database="example", user="user", password="password")
pg_table = pg_con.table("orders")
pg_result = pg_table.filter(pg_table.amount > 100).execute()
# Switching to DuckDB
duck_con = ibis.duckdb.connect("example.db")
duck_table = duck_con.table("orders")
duck_result = duck_table.filter(duck_table.amount > 100).execute()
The code above works similarly for both PostgreSQL and DuckDB, with the only difference being the connection setup.
4. Extensibility
Ibis allows developers to define custom functions and extend its capabilities. It also supports defining custom backends, which can be useful for proprietary systems.
Use-cases
Ibis is not a replacement for tools like Pandas or SQL but rather a complementary tool that fills specific gaps in multi-backend workflows:
Prototyping with local data. Ibis can use Pandas as a backend for local prototyping, making it easy to scale the same logic to a distributed system.
Abstracting backend complexity. Developers can work in Python without needing to learn or adapt to backend-specific query languages.
Data pipelines. Ibis can be part of a pipeline that integrates data from multiple systems, applying transformations consistently across different sources.
You might begin by exploring data locally in Pandas, but as datasets grow or workflows expand to involve SQL databases or analytical engines like BigQuery, you’re forced to rewrite your logic for each backend. This repetition adds friction and complexity, especially when working across diverse systems. Ibis addresses these challenges by providing a single, Pythonic API that works consistently across multiple backends, allowing you to write backend-agnostic code that scales.
Resources
Learn more.
Official Ibis documentation: https://ibis-project.org
GitHub repository: https://github.com/ibis-project/ibis