CSV files seem simple. Just plain text, right? But when they get large, they are not great formats for data exploration. Unlike databases, which are optimized for fast queries and efficient storage, CSVs lack indexing, compression, and structured access.
Why do people want to explore CSVs locally anyway?
CSVs have one big advantage: they’re easy to open and play with. You don’t need a database. You don’t need an internet connection. The data is just there. You can quickly scroll through it, search for patterns, and get a feel for what’s inside. There’s something powerful about being able to visually scan raw data.
1️⃣ Fast, Direct Access to Data
When data is in a database, you can’t just open it and look at it. You have to write a query, which means knowing what you’re looking for in advance. CSVs let you explore freely.
2️⃣ No Setup Required
There’s no database to configure, no credentials to manage. If you have the file, you can open it. That’s why so many datasets are shared as CSVs—even though they aren’t ideal for large-scale processing.
3️⃣ Good for Debugging and Spot-Checking
Sometimes, you don’t need to analyze the whole dataset. You just need to spot-check a few values, see what the column names are, or get a quick sense of the data structure. With a database, this takes multiple queries. With a CSV, you just open the file.
4️⃣ Works in Any Environment
CSVs are universal. They work on Mac, Windows, Linux. You don’t need special software. Even if your normal tools fail, you can still process them with basic command-line utilities.
—
That said, if you’ve ever tried to open a multi-gigabyte CSV locally, you’ve probably dealt with slow loading times, unresponsive applications, and even system crashes. This post explores why that happens, both from a technical and user experience perspective, and what you can do about it.
The UX nightmare of large CSVs
Downloading a Large CSV: A Slow, Painful Process
Let’s say you need to work with a big dataset. You download a 5GB CSV. What happens?
First, the download takes forever.
Some cloud services won’t even let you download a file that big. Google Drive, for example, might force you to zip it first, which means you now need twice as much disk space to store it.
Then you try to open it.
On a Mac: The default app is Numbers, which tries to load the whole thing into memory. If the file is too big, you’ll just get the spinning beach ball of death.
On Windows: Excel has a hard row limit of 1,048,576 rows. If your file is bigger, it either truncates or crashes.
Google Sheets? Maxes out at 10 million cells. If your dataset is wide, that might mean just a few hundred thousand rows.
Text editors aren’t much better.
Open a multi-gigabyte CSV in Notepad, TextEdit, or VS Code, and you’re probably out of luck.
Even editors designed for large files, like Sublime Text or Vim, struggle once you pass a few GB.
Pandas and Python: Not as easy as you think.
Run pd.read_csv("large_file.csv") in Python and watch as your RAM usage spikes. If the file is big enough, it will crash with an out-of-memory error.
Why large CSVs make your computer struggle
Most software assumes files are small enough to fit in memory. That’s fine for a 10MB CSV. But when the file gets too big, things break.
Why Does a 5GB CSV Crash Your Computer?
It’s not just about file size. It’s about how CSVs work in memory:
Parsing Overhead – The file isn’t structured data yet. It has to be tokenized, split, and converted.
Data Type Expansion – CSVs store numbers as text, but pandas converts them into native types, which take up more space.
Indexing & Metadata – Most tools add row indices, column names, and other overhead.
A 5GB CSV can easily take 15GB of RAM once loaded. If your laptop has 16GB, that’s a problem. Once memory runs out, the system starts swapping to disk, which is orders of magnitude slower than RAM.
Mac vs. Windows: Different Failures, Same Result
On Mac: The system starts swapping aggressively. The whole machine slows down. Eventually, an "Application Not Responding" message appears.
On Windows: You get a "Not Enough Memory" error, or Excel just crashes.
At this point, you’ve lost. Your computer isn’t just slow—it’s effectively locked up, and your data is still trapped inside that CSV file.
Why CSV Operations Are Computationally Expensive
CSVs don’t have indexes. They don’t have efficient storage. Every operation is brute force.
1. Searching a CSV is O(n)
In a database, searching is fast because of indexes. A well-designed index can make lookups O(log n), or even O(1) for hash-based queries.
In a CSV, there’s no index. Every search is a full scan—meaning O(n) time complexity.
Example: You have a 10 million-row CSV and want to find a single customer ID. The only way to do it is to scan every row. In a database, this would take milliseconds. In a CSV, it could take minutes.
2. Sorting and Filtering are Expensive
Sorting in a database is fast because it’s optimized at the storage level. In a CSV, sorting means:
Reading the whole file into memory.
Performing an O(n log n) sort.
Writing the sorted file back to disk.
If the file doesn’t fit in RAM, the system has to break it into chunks, sort each chunk separately, and merge them back together. This is painfully slow.
3. JOINs on CSVs are O(m × n)
In a database, JOINs use indexes and query optimizers. In a CSV, a join is just:
Load both files into memory.
Loop through every row in File A and compare it to every row in File B.
That’s O(m × n) complexity, which is completely impractical at scale.
We need a better way to explore data locally
CSVs make data easy to access, but once they get too big, everything slows down, crashes, or just stops working. Databases solve this, but they’re too much overhead when you just want to look at the data. Preswald gives you a way to explore large datasets locally, without loading everything into memory or dealing with unresponsive tools.