How to Do ChatGPT-like real-time Token Streaming on Gradio

Gradio

Gradio is a great tool to quickly showcase AI projects with a nice interface.

However, without token streaming, users have to wait for the entire response to be generated before they can get any output.

With token streaming, users can view the tokens being generated in real-time, just like in ChatGPT.

Here’s an example of streaming vs not streaming response:

Layout

First, let’s create the layout for our demo.

with gr.Blocks() as demo:
  gr.Markdown("""
  <br>
  
  # How to do real-time token streaming like ChatGPT in Gradio
  ### By @clusteredbytes
  <br>
  """)
  with gr.Row():
    with gr.Column():
      question = gr.Textbox(lines=5, label="Ask the AI anything you want:")
      gr.Examples(examples=["Write a poem in cockney accent telling why West Ham are massive.", "Write a poem about love."], inputs=question)
      btn = gr.Button(value="Get Response")
    with gr.Column():
      answer = gr.Textbox(lines=15, label="Response from AI:")

We’re using Gradio Blocks for this tutorial.
We have two columns.
Input on the left (in the question textbox), and generated output on the right (in the answer textbox).
We create a Button that’ll trigger generating the response.

Button Click Handler

Now let’s create a click handler for that button

btn.click(fn=get_ai_response, inputs=question, outputs=answer)

When the button is clicked -

the function get_ai_response is called
the string in the question textbox is passed as input
the output is sent to the answer textbox

Function to get tokens

Let’s see what the get_ai_response function does

def get_ai_response(input: str):
  response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  stream=True, # Enable token streaming
  temperature=0.2,
  messages=[
        {"role": "system", "content": "You're an AI assitant. Do what you're told to do by the user."},
        {"role": "user", "content": input},
  ])
  
  partial_response = ""
  for stream_response in response:
    token = stream_response["choices"][0]["delta"].get("content", "")
    partial_response += token
    yield partial_response

First it creates an OpenAI Chat completion with necessary parameters
To enable streaming, we add parameter stream=True
As we’re using stream=True, the chat completion returns a generator object instead of an AI response.
Rather than getting the complete response all at once, we could use the generator object to get the response in a token-by-token manner on-the-fly.
Hence the user don’t have to wait for the entire response to be received, but can see the response get populated incrementally token-by-token.

Now using that generator we got as response,

we get the tokens one-by-one
for each token we get, we add it to the partial_response variable
then we yield that partial_response
Gradio then shows that partial_response as intermediate output

Launch Gradio

Finally we launch the gradio Block using demo.launch()

demo.queue()
demo.launch(share=True, debug=True)

we set share=True to create a public link
we set debug=True to see logs and errors

Notice that we use demo.queue(), which is required for streaming intermediate outputs in Gradio.

Source Code

Here’s the full source code used in this tutorial:

import gradio as gr
import openai

def get_ai_response(input: str):
  response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  stream=True, # Enable token streaming
  temperature=0.2,
  messages=[
        {"role": "system", "content": "You're an AI assitant. Do what you're told to do by the user."},
        {"role": "user", "content": input},
  ])
  
  partial_response = ""
  for stream_response in response:
    token = stream_response["choices"][0]["delta"].get("content", "")
    partial_response += token
    yield partial_response


with gr.Blocks() as demo:
  gr.Markdown("""
  <br>
  
  # How to do real-time token streaming like ChatGPT in Gradio
  ### By @clusteredbytes
  <br>
  """)
  with gr.Row():
    with gr.Column():
      question = gr.Textbox(lines=5, label="Ask the AI anything you want:")
      gr.Examples(examples=["Write a poem in cockney accent telling why West Ham are massive.", "Write a poem about love."], inputs=question)
      btn = gr.Button(value="Get Response")
    with gr.Column():
      answer = gr.Textbox(lines=15, label="Response from AI:")
    
    btn.click(fn=get_ai_response, inputs=question, outputs=answer)

    demo.queue()
    demo.launch(share=True, debug=True)

I tweet about these topics and anything I’m exploring regularly. Follow me on twitter

Gradio#

Layout#

Button Click Handler#

Function to get tokens#

Launch Gradio#

Source Code#