Skip to Content
DevelopmentPolyphemusOverview

Polyphemus (Development)

Polyphemus is the model-serving service used for adversarial generation workloads. It proxies generation requests to Vertex AI and exposes authenticated REST endpoints.

Runtime and deployment notes

  • Runtime baseline: Python >=3.12
  • Router module: apps/polyphemus/src/rhesis/polyphemus/routers/services.py
  • Request schemas: apps/polyphemus/src/rhesis/polyphemus/schemas/schemas.py

API endpoints

Polyphemus exposes two primary generation endpoints:

EndpointPurposeAuth
POST /generateSingle generation requestBearer token required
POST /generate_batchBatch generation for multiple requestsBearer token required

/generate_batch accepts up to 50 items per call (MAX_BATCH_SIZE).

Environment configuration

Polyphemus reads Vertex AI target configuration from environment variables:

VariableRequiredDescription
POLYPHEMUS_ENDPOINT_IDYesVertex AI endpoint identifier
POLYPHEMUS_PROJECT_IDYesGCP project ID for endpoint invocation
POLYPHEMUS_LOCATIONNoVertex AI region (defaults to us-central1)

If required variables are missing, the service returns HTTP 400 with configuration error details.

Batch request and response format

generate_batch_request.json
{
  "requests": [
    {
      "messages": [
        {
          "role": "user",
          "content": "Summarize this policy document."
        }
      ],
      "temperature": 0.7,
      "max_tokens": 1024
    },
    {
      "messages": [
        {
          "role": "user",
          "content": "Extract key risks from this response."
        }
      ],
      "temperature": 0.2
    }
  ]
}
generate_batch_response.json
{
  "responses": [
    {
      "choices": [
        {
          "message": {
            "content": "..."
          }
        }
      ],
      "model": "vertex_ai/model",
      "usage": {
        "prompt_tokens": 120,
        "completion_tokens": 85
      }
    },
    {
      "error": "Generation timeout"
    }
  ]
}

Rate limiting is applied through check_rate_limit. For batch calls, one HTTP request counts as one rate-limit unit regardless of item count.