ChatGPT Integration with Python: 7-Step How-To Guide

ChatGPT Integration with Python: 7-Step How-To Guide

Content

Written by: Doug Camplejohn, CEO & Co-Founder, Coffee

Key Takeaways for Your Python ChatGPT Build

  • The 2026 OpenAI Python SDK replaces deprecated ChatCompletion patterns with the Responses API, native Pydantic structured outputs, and workload-identity authentication for production-ready integrations.
  • A seven-step workflow covers SDK installation, secure key handling, single-turn calls, multi-turn memory, streaming, structured JSON, and cost monitoring.
  • Developers can expose ChatGPT functionality through both Flask and FastAPI endpoints while using exponential-backoff retries and token-level usage tracking to control spend.
  • Security best practices include storing keys in environment variables or KMS, rotating credentials regularly, and never exposing API keys in client-side code.
  • Ready to eliminate manual data entry and let AI agents handle CRM tasks automatically? See how Coffee automates CRM logging for your team.

7-Step Quick Start

This sequence produces a working ChatGPT response in under five minutes.

  1. Install the SDK: pip install openai python-dotenv
  2. Create a .env file containing OPENAI_API_KEY=sk-... and add .env to .gitignore.
  3. Instantiate the client: from openai import OpenAI; client = OpenAI()
  4. Make a single-turn call: response = client.responses.create(model="gpt-5.5", input="Hello")
  5. Print the result: print(response.output_text)
  6. Add a messages list to your script to enable multi-turn memory.
  7. Wrap the call in try/except and log response.usage.total_tokens for cost tracking.

The sections below expand each step with production-grade detail.

Step 1: Install the OpenAI Python SDK and Secure Authentication

Version 2.41.0 of the OpenAI Python library, released June 3, 2026, requires Python ≥ 3.9 and installs with a single command:

pip install openai python-dotenv

For improved async concurrency, install the optional aiohttp extra with pip install openai[aiohttp]. Conda users can run conda install conda-forge::openai to get the package from conda-forge.

OpenAI strongly recommends storing the API key in an environment variable named OPENAI_API_KEY rather than hard-coding it. Load it at runtime with python-dotenv:

import os from dotenv import load_dotenv from openai import OpenAI load_dotenv() client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

OpenAI recommends using a Key Management Service for production deployments so keys remain encrypted and separate from application code. For Kubernetes, Azure, or GCP workloads, the SDK supports workload-identity authentication via short-lived tokens through providers such as k8s_service_account_token_provider, azure_managed_identity_token_provider, or gcp_id_token_provider. This approach removes the need for long-lived API keys. Using separate keys for development, testing, and production limits the blast radius if any single key is compromised.

Step 2: Make a Single-Turn Call with the Responses API

The Responses API is the primary modern call pattern, replacing the older chat.completions.create interface for new projects:

response = client.responses.create( model="gpt-5.5", instructions="You are a helpful assistant.", input="Summarize the benefits of async Python." ) print(response.output_text)

The generated text is returned directly in the .output_text attribute. The legacy client.chat.completions.create path remains supported indefinitely and returns text via completion.choices[0].message.content. Existing codebases can continue to use that interface without immediate migration.

Step 3: Maintain Multi-Turn Conversation Memory

The API is stateless, so conversation history must be stored and passed with every request. Build a list of message dicts and append each exchange:

history = [{"role": "developer", "content": "You are a Python tutor."}] def chat(user_input): history.append({"role": "user", "content": user_input}) response = client.responses.create(model="gpt-5.5", input=history) reply = response.output_text history.append({"role": "assistant", "content": reply}) return reply

Sending full conversation history with every message causes input token counts to grow; a 20-message conversation can make the final request cost 10× more than the first. Manage this by summarizing older turns, truncating non-critical history, or applying a sliding window to keep total tokens within the model’s context limit.

Step 4: Enable Streaming Responses

Streaming is enabled by passing stream=True to responses.create, then iterating over the returned stream to print tokens as they arrive:

# Synchronous streaming with client.responses.create(model="gpt-5.5", input="Explain generators.", stream=True) as stream: for event in stream: print(event.delta, end="", flush=True)

For async web applications, import AsyncOpenAI and use async for:

from openai import AsyncOpenAI import asyncio async_client = AsyncOpenAI() async def stream_response(prompt): async with await async_client.responses.create( model="gpt-5.5", input=prompt, stream=True, ) as stream: async for event in stream: print(event.delta, end="", flush=True) asyncio.run(stream_response("Explain Python decorators."))

Server-Sent Events (SSE) is a practical way to pipe streamed responses from a Python backend to a browser client. Streaming works especially well for interactive web applications where responsiveness matters.

Step 5: Return Structured JSON Outputs with Pydantic

Structured outputs use client.responses.parse with a Pydantic BaseModel passed as text_format. The validated result is accessed via .output_parsed:

from pydantic import BaseModel class TaskSummary(BaseModel): title: str priority: str next_step: str response = client.responses.parse( model="gpt-5.5", instructions="Extract task details from the note.", input="Follow up with Acme Corp about the Q3 proposal by Friday.", text_format=TaskSummary, ) task = response.output_parsed print(task.title, task.priority, task.next_step)

Add a validation checkpoint after parsing. Assert that all required fields are non-empty strings before passing the object downstream. This approach prevents silent failures when the model returns partial data.

This kind of structured, agent-driven data extraction powers Coffee’s automatic contact and activity logging. It turns unstructured call notes and emails into clean CRM records without human intervention. Try Coffee’s agent-led automation on your own pipeline data.

Build people lists automatically with Coffee AI CRM Agent
Build people lists automatically with Coffee AI CRM Agent

Step 6: Share ChatGPT Features Through Flask and FastAPI

Once you can generate structured outputs locally, the next step is to make that capability available to other applications. Web endpoints expose ChatGPT functionality as an HTTP service that front-end clients, mobile apps, or other backend services can call. A minimal Flask endpoint wraps the Responses API call and returns JSON:

from flask import Flask, request, jsonify from openai import OpenAI app = Flask(__name__) client = OpenAI() @app.route("/chat", methods=["POST"]) def chat(): user_input = request.json.get("message", "") response = client.responses.create(model="gpt-5.5", input=user_input) return jsonify({"reply": response.output_text}) if __name__ == "__main__": app.run(debug=False)

The equivalent FastAPI endpoint uses async natively, which pairs well with AsyncOpenAI for non-blocking I/O under concurrent load:

from fastapi import FastAPI from pydantic import BaseModel as PydanticModel from openai import AsyncOpenAI app = FastAPI() async_client = AsyncOpenAI() class ChatRequest(PydanticModel): message: str @app.post("/chat") async def chat(req: ChatRequest): response = await async_client.responses.create( model="gpt-5.5", input=req.message, ) return {"reply": response.output_text}

For web apps built with Flask or FastAPI, streaming model output to the frontend rather than waiting for full completion produces a noticeably more responsive user experience.

Step 7: Add Production Error Handling, Retries, and Cost Monitoring

Handle RateLimitError (HTTP 429) with exponential backoff that waits 2^attempt seconds before retrying, with a maximum retry cap to prevent infinite loops:

import time from openai import OpenAI, RateLimitError client = OpenAI() def call_with_retry(prompt, max_retries=4): for attempt in range(max_retries): try: response = client.responses.create(model="gpt-5.5", input=prompt) tokens = response.usage.total_tokens print(f"Tokens used: {tokens}") return response.output_text except RateLimitError: wait = 2 ** attempt time.sleep(wait) raise RuntimeError("Max retries exceeded.")

Extract response.usage.total_tokens after each call and multiply by the model’s published per-token price to calculate per-request costs. Set billing alerts in the OpenAI dashboard at 50%, 75%, and 90% of budget to catch runaway spend before it compounds. Retry mechanisms without proper backoff can generate redundant requests during outages, which can create high bills, so a retry cap is non-negotiable.

Security and Cost Controls for Stable Deployments

Never deploy an API key in client-side environments such as browsers or mobile apps, and route all requests through a backend server to keep keys in one controlled place. Because backend repositories can still leak credentials, commit a .env.example template containing only placeholder variable names so contributors know which variables are required without exposing credentials.

Rotate API keys on a regular schedule such as every 90 days and scan repositories with tools like Gitleaks or GitHub secret scanning to catch accidental commits early. OpenAI supports IP allowlisting, which rejects requests from unauthorized addresses even when a valid key is presented. Together, these practices create a layered defense around your credentials.

Caching identical or semantically similar queries can help reduce costs for applications with repetitive queries. Use a lower temperature value such as 0.2 for factual tasks to produce shorter, more consistent responses that consume fewer tokens and keep bills predictable.

Ready to skip the manual wiring entirely? Let Coffee’s AI agent handle data capture, enrichment, and logging for you.

Building a company list with Coffee AI
Building a company list with Coffee AI

Validation and Success Criteria Before Launch

Before promoting an integration to production, verify the following checkpoints. First, confirm that response.output_text is a non-empty string on at least 100 consecutive test calls. Second, for structured outputs, assert that every required Pydantic field is populated and that no ValidationError is raised.

Third, replay a five-turn conversation and confirm that the assistant references context from turn one in turn five, which validates memory retention. Fourth, inspect cost logs to confirm that token counts per request stay within the expected range for your prompt template. Fifth, trigger a deliberate RateLimitError by exceeding your tier’s requests-per-minute and confirm that the retry handler backs off correctly without exceeding the retry cap.

Variations and Scaling Considerations for High Traffic

For high-concurrency traffic, instantiate AsyncOpenAI with http_client=DefaultAioHttpClient() to take advantage of aiohttp’s connection pooling. Background workers such as Celery tasks, ARQ jobs, or asyncio task queues are appropriate for non-interactive workloads such as batch summarization or nightly enrichment runs.

Use request queuing to smooth traffic spikes and reduce the frequency of hitting OpenAI rate limits on requests-per-minute and tokens-per-minute thresholds. For cost efficiency, prefer a smaller model for high-volume or simple tasks and run A/B tests comparing it against larger models to identify the lowest-cost option that still meets quality requirements.

Frequently Asked Questions

How do I use the ChatGPT API in Python for free?

OpenAI does not offer a permanently free tier for API access, but new accounts receive a small credit that expires after a set period. To minimize spend during development, use a smaller model such as GPT-4o-mini, set a low max_tokens limit on every call, and cache repeated queries locally so identical prompts never hit the API twice. Set a hard spending limit in the OpenAI dashboard so a runaway loop cannot generate unexpected charges. Once your credit is exhausted, you must add a payment method to continue making API calls.

How do I run ChatGPT Python code locally?

Install the SDK with pip install openai python-dotenv, create a .env file in your project root containing OPENAI_API_KEY=your-key-here, and add .env to .gitignore. Call load_dotenv() at the top of your script before instantiating the client. The SDK reads the key automatically from the environment variable. All API calls are made over HTTPS to OpenAI’s servers, and there is no local model inference unless you separately run an open-source model with a compatible API interface.

How do I keep conversation history across multiple turns?

As explained in Step 3, the API is stateless. Your application must maintain a list of message dictionaries and pass the full list with every request. Append each user message before the call and each assistant reply after it. To prevent token costs from growing unbounded, apply a sliding window that keeps only the most recent N turns, or periodically summarize older turns into a single condensed message and replace the raw history with that summary. Store the history list in a database or cache if it needs to persist across server restarts or multiple user sessions.

What is the difference between the Responses API and the Chat Completions API?

The Responses API is the current recommended interface introduced in 2025 and used throughout this article. It accepts input, which can be a string or list of messages, plus an optional instructions string for system-level guidance, and it returns the response text directly. The Chat Completions API, accessed via client.chat.completions.create, accepts a messages list and returns the response content from the completion. Both are supported indefinitely, so existing code using Chat Completions does not need to be rewritten, but new projects should use the Responses API for access to the latest features including native structured output parsing.

How do I prevent unexpected API costs in production?

As covered in Step 7, configure billing alerts at multiple thresholds to catch cost overruns early. Always set a max_tokens parameter on every API call to cap response length. Implement per-user rate limiting at the application level so no single user can exhaust your quota. Cache frequent or repeated queries in Redis or a similar store and serve cached results without making an API call. Log token usage on every request and attribute costs to specific features or endpoints so you can identify which parts of your application are most expensive and reduce their usage first.

Conclusion

The 2026 OpenAI Python SDK provides a clean, production-ready path from a single-turn CLI call to a fully streamed, structured-output FastAPI service. Workload-identity auth, Pydantic validation, exponential-backoff retries, and token-level cost monitoring are all available out of the box. Each pattern in this guide reflects the same agent-led automation philosophy that Coffee applies to CRM: capture structured data reliably, eliminate manual entry, and surface accurate insights automatically.

If your team spends hours stitching AI calls into scripts while sales reps still log calls by hand, the underlying problem is the same. Put Coffee’s AI agent to work on your pipeline today.