Index ¦ Archives

Projects

Building a Personal AI Workstation with 4x NVIDIA RTX 6000 Pro Blackwell GPUs

I built a personal workstation with 4x of the RTX 6000 Pro Blackwell MaxQ (384GB total VRAM), designed originally as an AI Workstation that you can keep under your desk, with a standard 15amp, 120V power outlet.

Our total BOM cost of this workstation is ~$39.5K.

GPU_Workstation Marco Mascorro

Jensen Huang (Nvidia founder & CEO) signed the first one I built:

GPU_Workstation Marco Mascorro and Jensen Huang

Parts:

  • 4x NVIDIA RTX 6000 PRO Blackwell Max-Q, 384GB total GRDDR7 VRAM (96GB per GPU), all cards running on PCIe 5.0 x16 lanes.
  • 8TB of NVMe PCIe 5.0 storage RAID 0, 59.6GB/s aggregate theoretical read throughput. With NVIDIA GPUDirect Storage (GDS), it allows the GPUs to fetch data directly from NVMe drives, enabling direct-memory access (DMA), skipping the DDR5 RAM.
  • AMD Threadripper PRO 7975WX (32 cores, 64 threads)
  • 256GB ECC DDR5 RAM
  • 1650 Watts at peak (runs on a standard 15Amp/120V circuit).

The RTX 6000 PRO Blackwell Max-Q delivers 24,064 CUDA cores, 96GB of GDDR7 VRAM memory, PCIe 5.0 x16, and next-gen Blackwell architecture in a remarkably efficient 300W powerhouse.

The liquid-cooled AMD Threadripper PRO 7975WX features 32 cores and 64 threads on 5nm Zen 4 (Storm Peak), with DDR5, 128 PCIe 5.0 lanes, 128 MB L3 cache, and clock speeds up to 5.3 GHz boost / 4.0 GHz base.

The workstation also integrates an AST2600 Baseboard Management Controller (BMC), a dedicated processor for remote out-of-band management that operates independently of the host CPU and OS to handle critical monitoring and control tasks:

This was built for a16z portfolio founders for companies I invest in. The full guide on how to build one can be found here.

GPU_Workstation Marco Mascorro
GPU_Workstation Marco Mascorro

AI GPU Workstation with 8x 4090/5090 GPUs with PCIe5.0 16x lanes

I built a couple of GPU workstations with the RTX GPUs. The RTX 4090 RTX 5090 are absolute beasts. With 24GB of VRAM and 16,384 CUDA cores on the RTX 4090, and 32GB of VRAM and 21,760 CUDA cores on the RTX 5090, both deliver exceptional FP16/BF16 and tensor performance for their cost.

The RTX 3090s were the last RTX GPUs that had NVLink, and since the 4090s, there's no NVLink interconnect in the GPUs, which is crucial for high memory bandwidth when training models. This means that PCIe connectivity and utilizing the latest PCIe version (4.0 or 5.0, respectively) with 16x lanes is key to building an AI workstation to maximize bandwidth between the cards.

As an experiment and for research purposes, I built two nodes of 8x RTX 4090 GPU AI workstations from scratch, which could be compatible with the new RTX 5090 with PCIe 5.0 running at 16x lanes, for training, deploying, and running AI models locally.

The parts used per workstation:

  • Server model: ASUS ESC8000A-E12P
  • GPUs: 8x NVIDIA RTX 4090
  • CPU: 2x AMD EPYC 9254 Processor (24-core, 2.90GHz, 128MB Cache)
  • RAM: 24x 16GB PC5-38400 4800MHz DDR5 ECC RDIMM (384GB total)
  • Storage: 1.92TB Micron 7450 PRO Series M.2 PCIe 4.0 x4 NVMe SSD (110mm)
  • Operating system: Ubuntu Linux 22.04 LTS Server Edition (64-bit)
  • Networking: 2 x 10GbE LAN ports (RJ45, X710-AT2), one utilized at 10Gb. You can replace one of this (or add) a Mellanox card for faster interconnect speeds.
  • Additional PCIe 5.0 card: ASUS 90SC0M60-M0XBN0

This was built only for fun and educational purposes . The full guide can be found here.

GPU_Workstation

Silver medalist, 2024 Arc Prize (14th place, top 1%)

I worked for ~2.5 weeks on making a submission with a custom set of trained models to the ARC challenge competition on Kaggle to see how far we could push AI models on this benchmark.

Currently, in position 14th out of 1427 teams, which got a silver medal on Kaggle for placing in the top 1% of all competition participants. I plan to keep working on this (still mostly for fun), but training these models can get a bit pricey.

The prized competition has some requirements: the model(s) need to fit in a single P100 GPU (16Gb) or in two T4s (16Gb each), and a maximum running time of 12 hours for all the tasks and the solution needs to be completely offline, that means no GPT4, Claude, API calls, etc. So it only leaves you with a few options of pre-trained models you could use.

Some approaches:

  • Inference-time search
  • Extended test-time compute
  • Data augmentation and synthetic data of known puzzles (transduction)
  • CoT traces on multiple solutions (positive and negative) with grids (transduction)
  • CoT traces on multiple solutions (positive and negative) with code synthesis (induction)
  • Long training runs over synthetic data
  • Multiple models for overall solution (induction and transduction)
  • Active inference (train model on the fly)
  • Used Llama3 and Qwen models
ARC AGI Puzzle
ARC AGI

JungleGym AI | A playground for testing and developing autonomous agents with AI models (2023)

JungleGym is an open source playground for testing and developing autonomous agents with different agent datasets and benchmarks, like WebArena, Mind2Web and AgentInstruct. The playground has realistic, fully functional, sandboxed web sites, benchmarks and an API, all in a single environment.

JungleGym GitHub Repo

Llama2.ai | The first llama2 chatbot interface (2023)

I built and deployed Llama2.ai, the first AI chatbot interface for the Llama2 models, which became the most popular interface for the Llama2 models when Mark Zuckerberg announced them. Reached over 3000+ concurrent requests. The system was built with on 100 load-balanced servers and hundreds of GPUs (H100s). (Llama2.ai is now run by Replicate)

Llama2 Github repo

AnyPod.ai | Semantic Search Engine for Youtube/Podcasts (2022)

A weekend project to search for phrases, ideas or semantic questions on YouTube/Podcasts. Used Whisper, an embedding model (all-mpnet-base-v2) with Sentence Transformers and FAISS to return top k 20 results from multiple simultaneous YouTube channels.

AnyPod Demo

arXivGPT | Twitter bot that posts the most relevant/trending AI papers on arXiv.org (2023)

I built arXivGPT originally for myself while trying to filter through and read the most relevant AI papers on arXiv.org, as there was a lot of noise. ArXivGPT creates summaries of trending AI papers on arXiv (using a set of heuristics), listing the most important points and authors.

YouTranscription | Automated Whisper transcripts from YouTube channels (2022)

A weekend project to Automate transcriptions of YouTube videos or channels into searchable transcripts.

YouTranscription Demo

LoweBot | An autonomous customer service and inventory management Robot (2016)

LoweBot is a robot that was deployed across multiple retailers, including Lowe's Home Improvement stores. It's an autonomous customer service robot that helps customers find items in stores, understanding multiple languages, and it scans inventory using computer vision overnight.

SIMPL Demo

SIMPL (2016)

I trained a custom AI model to allow an easy way to program industrial robot arms with no coding required, only with drawing bounding boxes over objects in a easy to use interface. It uses a custom trained single-shot CNN trained on thousands of objects running locally.

SIMPL Demo

AI model for weapon detection (2022)

I trained AI model to detect weapons from low resolution security cameras running in real-time (~25fps) in constrained compute (optimized model for local hardware) and send SMS alerts.

Camera detection Demo

Covid-19 Ventilator (2020)

I built an open-source Covid ventilator due to the shortage of ventilators in 2020, which helped over 200 teams around the world building and deploying ventilators. The first version of this was running on a Raspberry Pi. Later versions used a custom PCB with an ARM chip.

Ventilator Demo

Covid Contact Tracing by precise proximity with Ultra-wideband (2020)

As many of us were impacted with Covid, and we were trying to help the community at the same time during that difficult time, we created a Covid contact tracing device using precise distance measurement using Ultra-wideband (UWB). The contact tracer allows people to socially distance (by an haptic feedback) and keep track of every contact within 10 feet precisely.

Contact tracing Demo

Vision-based, autonomous wheelchair robot (2011)

I developed an autonomous wheelchair robot (KuruRobo) using VSLAM for disabled people while I was doing Computer Vision research at the Kanazawa Institute of Japan.

KuruRobo Demo

Other random ones in Japan

KuruRobo Demo

© Marco Mascorro. Built using Pelican. Theme by Giulio Fidente on github.