Mac maintenance

Setting Up a Mac for Data Analysis and Pandas/R Workflows

A data analyst's Mac setup for 2026. Python, R, Jupyter, conda, uv, RStudio, dataset storage, and the workflow that handles 10 GB CSVs without crashing.

January 4, 2026 10 min read

Friday afternoon, you opened a 4.2 GB customer transaction CSV in Pandas, ran a groupby that didn’t fit in 16 GB of RAM, watched the M2 MacBook Air swap to disk for 15 minutes, and finally killed the kernel. Then you remembered the parallel R script someone in the Slack channel was running comfortably on their MacBook Pro with 36 GB. Data work is one of the few Mac workflows where RAM is the real bottleneck, not storage or CPU.

Here’s a Mac setup for data analysis that holds up against real-world dataset sizes, with the language toolchains, environment management, and workflow choices that pay off daily.

Hardware: RAM is the answer

For most data analysis (datasets under 5 GB, exploratory work, dashboards):

MacBook Pro M3 or M4 Pro, 36 GB RAM, 1 TB SSD. The 36 GB is the meaningful upgrade — Pandas, scikit-learn, and R all benefit from headroom.

For ML, big datasets, and serious computation:

MacBook Pro M4 Max, 48–64 GB RAM, 1–2 TB SSD. The Max chip’s GPU cores and unified memory are great for PyTorch and TensorFlow on Apple Silicon (MLX).
Mac Studio M4 Max or M4 Ultra, 64–192 GB RAM, 2–4 TB SSD. The right machine if you’re loading 30 GB into memory regularly or training models locally.

For lighter work (SQL, BI tools, dashboards):

MacBook Air M3 or M4, 24 GB RAM, 512 GB SSD is enough.

The single biggest mistake: 16 GB RAM. Works for tutorials, dies on real datasets. 24 GB is the floor for serious data work; 36+ GB is comfortable.

Python toolchain

The 2026 Python data stack on Mac:

Python via Homebrew or pyenv, never the system Python.
uv (the new pip and venv replacement, written in Rust, dramatically faster) — emerging as the standard for package management.
conda or mamba if you need the conda ecosystem (pre-built scientific binaries, GPU/CUDA stuff that’s mostly Linux but some Mac-relevant packages).
Jupyter Lab or VS Code with the Jupyter extension — VS Code is winning the market in 2026.
Cursor if you want AI-assisted analysis directly in the editor.

Install path:

brew install uv
uv venv .venv
source .venv/bin/activate
uv pip install pandas numpy scikit-learn matplotlib seaborn jupyterlab

Per-project virtual environments are non-negotiable. Different projects need different package versions, and global Python becomes a tangled mess in a year.

For Apple Silicon-native ML:

PyTorch has Metal Performance Shaders (MPS) backend — works on Apple Silicon GPUs, slower than CUDA but real.
MLX is Apple’s own framework, optimized for unified memory. Surprisingly fast on M3 Max and Ultra chips.
TensorFlow has Mac builds via tensorflow-metal.

R toolchain

The 2026 R stack:

R from CRAN (cran.r-project.org) — the official source, Apple Silicon native build.
RStudio Desktop — still the standard IDE, free, native Apple Silicon.
Positron — Posit’s new IDE replacing RStudio, multi-language, beautifully built.
renv for project-specific package management (R’s equivalent of venv).
Quarto — the Jupyter/R Markdown successor, multi-language reproducible documents.

Install:

brew install --cask r
brew install --cask rstudio
brew install --cask quarto

R packages worth installing globally:

install.packages(c("tidyverse", "data.table", "arrow", "DBI", "plumber", "shiny", "rmarkdown", "knitr", "renv"))

For high-performance data work in R, data.table and arrow outperform dplyr on big datasets. The tidyverse is more readable; data.table is faster.

Tip: Apache Arrow (`pyarrow`, `arrow` in R) lets Python and R share data through columnar formats. Same Parquet file, instant cross-language. Worth learning even if you only use one language today.

Notebooks and IDEs

The data analyst workflow lives in notebooks and IDEs.

Jupyter Lab — the classic. Browser-based, kernel-per-language, highly extensible. Still the standard for exploratory work.

VS Code with Jupyter extension — increasingly popular. Better git integration, extensions for everything, runs notebooks inline with proper variable inspection.

RStudio — the dominant R IDE for years; still excellent.

Positron — Posit’s new cross-language IDE, replacing RStudio. Native R and Python support, built on VS Code’s foundation. Worth trying in 2026.

DataSpell — JetBrains’ data-specific IDE. Powerful but heavy.

Marimo — a reactive Python notebook environment, gaining traction.

For Jupyter: install the jupyterlab-system-monitor extension to see RAM and CPU usage live. Critical for noticing when a cell is about to OOM.

Database and data infrastructure

Most data work touches at least one database.

Local databases:

DuckDB — the breakout 2024–26 tool. SQL on parquet/CSV/Pandas DataFrames, dramatically fast. Install via pip install duckdb. Use for analytical queries on local files; replaces a lot of Pandas grunt work.
Postgres via Postgres.app — for development and analysis on real databases.
SQLite — for everything that fits on disk; the unsung hero.

GUI database clients:

TablePlus — fast, native, $89. Most data folks’ favorite.
DBeaver — free, Java-based, works with everything.
DataGrip — JetBrains, expensive but powerful.

Cloud connections:

Snowflake, BigQuery, Redshift clients via Python (snowflake-connector-python, google-cloud-bigquery, redshift-connector).
AWS CLI, gcloud CLI, az CLI — install via Homebrew, configure once.

Storage strategy for datasets

Data work generates a lot of intermediate files. Plan for it.

Structure:

~/Data/
  Projects/
    project-name/
      data/
        raw/        <- never modified
        interim/    <- intermediate processing
        processed/  <- final analysis-ready
      notebooks/
      src/
      reports/
  Datasets/         <- shared, frequently-used
  Archive/          <- old projects

Storage tiers:

Internal SSD: active project data, small reference datasets.
External NVMe: large datasets you work with regularly (1–4 TB).
Cloud storage: S3, GCS, or Azure Blob for the truly large stuff. Connect via boto3, google-cloud-storage, etc.

File formats:

Parquet for analytical data. Columnar, compressed, schema-aware. Pandas, R, DuckDB, Spark all read it natively.
CSV only when interoperability demands it. Slow, no schema, no compression.
JSON / JSONL for nested data and APIs.
Feather/Arrow IPC for fast cross-language transfer.

Switching from CSV to Parquet often cuts file size by 5–10x and read times by 10–50x. Worth doing on any dataset over 100 MB.

Set it up once, stay clean for lifeSweep does the routine cleanup so you can stay in your work. Download Sweep free →

Visualization and reporting

The 2026 visualization landscape on Mac:

matplotlib + seaborn — the Python defaults, still essential.
plotly — interactive web-based charts, good for dashboards.
altair — grammar-of-graphics in Python, lovely for analytical work.
ggplot2 — the R standard, still the gold standard for static publication-quality charts.
Observable — JavaScript notebooks, increasingly used by data teams.
Streamlit, Dash, Shiny — for quick dashboards from Python or R.
Quarto — multi-language reproducible reports, beautiful HTML and PDF output.
Tableau, Power BI, Looker — BI tools; install desktop versions if your team uses them.

For exploratory work: matplotlib and altair in notebooks. For shared dashboards: Streamlit or Shiny. For polished reports: Quarto.

Performance and memory tactics

Data work on Mac runs into the unified memory wall fast. Tactics:

Lazy evaluation: use Polars or Dask instead of Pandas for datasets that don’t fit. Polars in particular is dramatically faster than Pandas for many operations and supports lazy/streaming evaluation.
DuckDB for SQL on big files: query a 50 GB Parquet file from DuckDB without loading it into memory. Game-changer for analytical work.
Sample first, scale second: develop your analysis on a 1% sample, then run on the full dataset overnight.
Profile memory: memory_profiler for Python, lobstr and bench for R. Find the cell that doubles memory use.
Apache Arrow as an in-memory format — pyarrow and arrow in R. Faster than Pandas for many operations.

For ML on Apple Silicon: PyTorch + MPS backend works for training models that fit in unified memory. Beyond that, train on cloud GPUs (vast.ai, Lambda Labs, AWS) and bring weights back.

Free download for macOSSweep finds and clears the gigabytes of cruft that pile up around any heavy workflow. Try Sweep free →

Maintenance for data Macs

Data Macs accumulate specific kinds of clutter:

conda environments — ~/miniconda3/envs/ or ~/anaconda3/envs/. Each environment is 1–10 GB. Run conda env list and remove ones you haven’t used in 6 months.
pip cache — ~/Library/Caches/pip/. Clear with pip cache purge.
Jupyter kernel registrations — jupyter kernelspec list and remove dead ones.
R package binaries at ~/Library/R/ — large, mostly fine to leave but can be reinstalled if corrupt.
Notebook checkpoints — .ipynb_checkpoints folders accumulate. Add to .gitignore and clear periodically.
Datasets in ~/Downloads/ — every analyst’s Downloads folder has 30 GB of CSVs from 2 years ago.

Monthly: clean conda environments, clear pip and Jupyter caches, review project storage. Quarterly: archive old project folders to external storage, audit installed packages. Annually: rebuild Python and R from scratch (catches stale packages and version drift).

Data analysis on Mac is excellent in 2026. The toolchain is mature, the Apple Silicon performance is strong, and the workflows scale to most real-world dataset sizes. Set up the environment carefully once, and the work flows.

← Back to all guides

Hardware: RAM is the answer

Python toolchain

R toolchain

Notebooks and IDEs

Database and data infrastructure

Storage strategy for datasets

Visualization and reporting

Performance and memory tactics

Maintenance for data Macs

More on mac maintenance

10 Mac Keyboard Shortcuts That'll Save You Hours a Week

25 Things to Try When Your Mac Is Slow (in Order of Effort)

50 Free Mac Apps Every Owner Should Know About

An Annual Mac Checkup: 10 Things to Do Every Year