Skip to content

March 12, 2025

How I Broke My 20-Year SPSS Habit To Vibe Code Something in a Python Colab Notebook

By Gene Kim

So, I’m writing a book with Steve Yegge (famous for his 20 years at Amazon and Google) on how developers can use GenAI to do amazing things. The title of the day is something like “Chat-Oriented Programming,” or “Chat or Vibe Engineering For Professionals,” or “No Vibe Coding When I’m On Call” (thank you, Jessie Young!). (The book product page just got listed here.)

The purpose of the book is to share some of the amazing and even life-changing moments we’ve both had using AI for coding. For instance, in December, I yet again had my mind blown by how much GenAI is changing my mind about what coding even is.

Last month, Dr. Andrej Karpathy coined the term “vibe coding” — he wrote how it’s “where you fully give in to the vibes, embrace exponentials, and forget that the code even exists,” he noted in a widely shared tweet that quickly went viral. “I just talk…, I barely even touch the keyboard. I ask for the dumbest things like ‘decrease the padding on the sidebar by half’ because I’m too lazy to find it. I ‘Accept All’ always, I don’t read the diffs anymore. When I get error messages, I just copy/paste them in with no comment, usually that fixes it.”

Unlike Dr. Karpathy, I didn’t turn my brain off for this particular project — I was deep in concentration for an hour, doing lots of typing, where each step required deep concentration, because I was doing things I’ve never done before, in a language I’ve barely used.

For the last 20+ years, I’ve used SPSS as my weapon of choice for statistical analysis. SPSS is fascinating because at its core, it’s built on the SPSS Syntax programming language that dates back to 1968. Even SPSS fans acknowledge that we’re using the best technology that the punchcard and teletype eras had to offer. Here’s what Syntax code looks like:

DESCRIPTIVES
VARIABLES=age income
/STATISTICS=MEAN STDDEV MIN MAX.
COMPUTE new_var = age * 2.

You can almost see the punchcards, right?

If you used SPSS before 1992, you had to program in Syntax directly. In 1992, they shipped a Windows GUI that made routine statistical calculations easy, later converting it to Java Swing in 2007. However, these GUIs are just wrappers around Syntax.

I loved using SPSS, because they’ve spent nearly sixty years optimizing it to make statisticians more productive. One example is how amazing it is at importing and converting data into a usable form for analysis. It deals incredibly well with messy spreadsheets, inconsistent date formats, or datasets riddled with encoding issues, magically doing conversions and employing sensible strategies to preserve as much data as possible. It automatically detects data types, detects categorical variables, handles missing data well…

In comparison, doing statistics in almost anything else seemed to require hours of painstaking and extensive scripting to clean the data. Suddenly, you’re writing tons of regular expressions, writing logic to either manipulate the data or selectively ignore it, just so you can get the data in a format where you can start analyzing it.

(I’m getting agitated just writing this, because I’ve done this so many times in Perl, Ruby, R, Python, etc. In contrast, SPSS does so much of this for you.)

Around the early 2010s, I explored alternatives like Python’s numpy packages and R. These modern tools offered amazing capabilities, especially for visualization. But I kept running into a frustrating wall: tasks that I could do in seconds with SPSS would take hours of searching documentation and Stack Overflow in Python or R. It felt like learning to walk again, so I stuck with SPSS.

In the early years of the DORA research, I was so delighted that Dr. Nicole Forsgren was also an SPSS fan, and it was the perfect way for us to share our work with each other.

Not to say that I didn’t have genuine envy of the Python statistical ecosystem. In 2018, I watched Dr. Stephen Magill doing some analysis when we were working together on the State of the Software Supply Chain Report. He was using SciKit-Learn in a Jupyter notebook, working in a nice REPL-like way, showing off the kickbutt scientific visualizations that the Python ecosystem is so famous for.

It was dazzling. But the learning curve looked so high that I never seriously considered using it myself.

In December, I had to solve an urgent problem. I wanted to analyze the tens of thousands of attendees who attended a conference I chaired over ten years (Enterprise Technology Leadership Summit, formerly DevOps Enterprise Summit). But I couldn’t get an SPSS license key because I couldn’t log into the IBM e-commerce site, because I never got the email to confirm my account reactivation.

I needed to solve this problem that day, and I wondered if I could solve it in a modern Python way, using all the tools that I saw Dr. Magill using year ago.

But here’s the thing: I have probably less than 100 hours of Python experience total, and it’s been a decade since I’ve done any real Python work. The barrier to entry was daunting. Even the idea of getting my head around virtual environments (uv or pyenv?), package managers (pip or conda or ??), numpy, pandas, matplotlib — and I was already scared about having both python2 vs python3 on my laptop.

As Allison McMillan said when she was Director of Engineering at GitHub: “It shouldn’t be a core competency to manage two different versions of Python on your machine.” I remember not even being able to get one version running! It was never a good year to tackle this problem.

But then I discovered something incredible: there’s a coding assistant in Google Colab, which is a hosted Python Notebook platform! My first prompt was comically basic, revealing the depth of my complete ignorance of how things worked in these notebooks: “I have a CSV file I’d like to analyze. how do I upload a file?”

(This should convince you that I knew almost NOTHING about python notebooks!)

But this embedded Google Gemini was an infinitely patient teacher, and generated the code for me to do that. (I like to think that I could have figured it out in a half an hour. But if I’m brutally honest, given how alien the environment was, it could easily have taken me hours.)

(In December, the Gemini inside of Colab seemed quite old — like, literally a pale shadow of what you can get inside of Google AI Studio.)

From there, the chat coding questions flowed naturally:

  • How many rows are in this data
  • List all the columns
  • Generate a bar graph of the attendees by year
  • Rename these columns

(In December, the Gemini inside of Colab seemed quite old — like, literally a pale shadow of what you can get inside of Google AI Studio.)

I gained confidence that the data was imported correctly, and also learned enough about how the Pandas dataframes worked to understand the code Gemini was generating.

Then I got to my real challenge: actually analyzing the attendee data. There are 24 events, coded like “Las Vegas 2024,” “Amsterdam 2023,” “Virtual 2020,” and so on. I needed to extract years into ordinal values and create proper columns for analysis.

I found myself paralyzed thinking about how to approach this. I wasn’t even sure how I’d do it in SPSS. So I asked Gemini: “I need columns that indicate they attended in a given year, so I can express ideas like ‘after 2020 pandemic,’ and express recency and frequency.”

Looking at the code it generated, I was in awe. It would have taken me hours, maybe days, to figure this out on my own. Here are things I do not know:

  • how to do regular expressions in Python
  • how to do mappings from “yes/no” to booleans
  • the rules around numeric conversions, sets, and so on.
  • what are the rules for null values (or do they call them nils?) are handled in dicts

I’ve used pandas dataframes exactly once, and primarily what I know about them comes from watching Magill work years ago. Heck, I wouldn’t have even gotten this far — I honestly don’t even know which libraries to use for stuff like this!

I show some code below, not for you to study, but for illustrative purposes. Increasingly, I look at code like this, and think, “Umm, sure. Looks good to me?”

import pandas as pd
import re


event_cols = []  # Global list for event columns


def process_event_attendance(df):
   """
   Process event attendance data to calculate various attendance metrics.


   Additional metrics:
   - EventsSincePandemic: Number of events attended in or after 2022
   - EventsSince2019: Number of events attended in or after 2019
   """
   global event_cols  # Declare we're using the global variable


   # Create a copy of the DataFrame to avoid modifying the original
   df = df.copy()


   # Get all event columns (columns containing a 4-digit year)
   event_cols = [col for col in df.columns
               if re.search(r'^\d{4}\s', col)
               and col not in ['FirstYearAttended', 'LastYearAttended']]
:
:
:


   pandemic_cols = [col for col in event_cols if int(re.search(r'^\d{4}', col).group(0)) >= 2022]
   df['EventsSincePandemic'] = df[pandemic_cols].sum(axis=1).astype(int)


   cols_since_2019 = [col for col in event_cols if int(re.search(r'^\d{4}', col).group(0)) >= 2019]
   df['EventsSince2019'] = df[cols_since_2019].sum(axis=1).astype(int)


   return df

My strange aha moment was this: “I don’t actually need to understand this code. I see the results, and the results look correct. I’ve done this for decades, and I have confidence that this function coded the events correctly, just as SPSS would have.”

(And by the way, that code block? ChatGPT generated the HTML and CSS to make it display the way it did. Just looking at it, I would have no idea whether it was correct. I just plugged it in, and viewed the page. “Looks good to me!” 😂)

(Because of the seemingly old version of Gemini, I eventually switched to Claude, where I got much better coding responses.)

Critics might say, “Wait, you CAN’T trust AI-generated code if you don’t fully understand it!” But I’d counter that we all use code we don’t fully understand every day. When’s the last time anyone read through the source code of their SPSS procedures or R libraries? What matters is that the code is testable, follows established patterns, and produces verifiable results.

But this isn’t mission-critical production software. It’s just an analysis I’m doing for myself. This solved my problem, just as if I had used SPSS.

For me, this represents an entirely new way of programming. I was able to do sophisticated data analysis in a language and ecosystem I barely knew, producing results that would have been out of reach otherwise. The LLM served as both teacher and coding partner, helping me understand enough to make informed decisions while handling the complex implementation details.

This isn’t just about writing code faster — it’s about being able to attempt things that would have been completely out of reach before. In SPSS, I knew every feature after decades of use. But with chat coding and Python, I could immediately work at a similar level of sophistication in an entirely new ecosystem. That’s transformative.

Watch Steve Yegge talk about “chat and vibe programming” here, where he spoke at the February ETLS Connect event two weeks ago!

- About The Authors
Avatar photo

Gene Kim

Gene Kim has been studying high-performing technology organizations since 1999. He was the founder and CTO of Tripwire, Inc., an enterprise security software company, where he served for 13 years. His books have sold over 1 million copies—he is the WSJ bestselling author of Wiring the Winning Organization, The Unicorn Project, and co-author of The Phoenix Project, The DevOps Handbook, and the Shingo Publication Award-winning Accelerate. Since 2014, he has been the organizer of DevOps Enterprise Summit (now Enterprise Technology Leadership Summit), studying the technology transformations of large, complex organizations.

Follow Gene on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    How I Built a Google Docs Add-On in Three Hours (Using Vibe Engineering)
    By Gene Kim

    So, I'm writing a book with Steve Yegge (famous for his 20 years at…

    How I Broke My 20-Year SPSS Habit To Vibe Code Something in a Python Colab Notebook
    By Gene Kim

    So, I'm writing a book with Steve Yegge (famous for his 20 years at…

    Flow Engineering Immersion Course: Transform How Teams Work Together
    By Leah Brown

    We're excited to announce that the Flow Engineering Immersion Course, the third installment in…

    Becoming a Better Leader Part 2: Building Trust Through Understanding
    By Leah Brown

    Trust is the foundation of effective leadership. Yet in many organizations, trust remains elusive—especially…