Skip to Content
ContributePolyphemusOverview

Polyphemus (Development)

Polyphemus is the model-serving service used for adversarial generation workloads. It proxies generation requests to Vertex AI and exposes authenticated REST endpoints.

Runtime and deployment notes

  • Runtime baseline: Python >=3.12
  • Router module: apps/polyphemus/src/rhesis/polyphemus/routers/services.py
  • Request schemas: apps/polyphemus/src/rhesis/polyphemus/schemas/schemas.py
  • Docker image: API-only service image. PyTorch is not bundled in the Polyphemus container; model weights and serving runtime live behind Vertex AI.

API endpoints

Polyphemus exposes two primary generation endpoints:

EndpointPurposeAuth
POST /generateSingle generation requestBearer token required
POST /generate_batchBatch generation for multiple requestsBearer token required

/generate_batch accepts up to 50 items per call (MAX_BATCH_SIZE).

Environment configuration

Polyphemus reads Vertex AI target configuration from environment variables:

VariableRequiredDescription
POLYPHEMUS_ENDPOINT_IDYesVertex AI endpoint identifier
POLYPHEMUS_PROJECT_IDYesGCP project ID for endpoint invocation
POLYPHEMUS_LOCATIONNoVertex AI region (defaults to us-central1)
VLLM_LOGGING_LEVELNovLLM container log verbosity for Vertex serving (for example, DEBUG, INFO)

If required variables are missing, the service returns HTTP 400 with configuration error details.

Deployment region variable mapping (v0.2.8+)

Region configuration uses two separate variables depending on context:

ContextVariableSourceWhere it is consumed
GitHub Actions CI/CD workflowREGIONsecrets.REGION (falls back to secrets.TF_VAR_REGION, then us-central1).github/workflows/polyphemus.yml
Running Polyphemus servicePOLYPHEMUS_LOCATIONSet to $REGION by the CI workflowapps/polyphemus/src/rhesis/polyphemus/routers/services.py
Vertex model deployment scriptGCP_REGIONSet directly in the local environment (not mapped from REGION)apps/polyphemus/model_deployment/config.py

The workflow maps REGIONPOLYPHEMUS_LOCATION automatically for service deployments. The model deployment script reads GCP_REGION independently; when running it locally you must export GCP_REGION yourself (see apps/polyphemus/model_deployment/.env.example).

vLLM logging level (v0.2.9+)

When deploying Polyphemus to Vertex AI, you can control serving container verbosity with VLLM_LOGGING_LEVEL.

deploy-polyphemus.sh
export VLLM_LOGGING_LEVEL=DEBUG
python apps/polyphemus/model_deployment/deploy.py --skip-existing

If set, deployment injects VLLM_LOGGING_LEVEL into the serving container environment.

Polyphemus deployment separates the lightweight API container from the Vertex AI serving container. Configure vLLM logging on the Vertex deployment, not by installing PyTorch or model runtime dependencies into the Polyphemus API image.

Batch request and response format

generate_batch_request.json
{
  "requests": [
    {
      "messages": [
        {
          "role": "user",
          "content": "Summarize this policy document."
        }
      ],
      "temperature": 0.7,
      "max_tokens": 1024
    },
    {
      "messages": [
        {
          "role": "user",
          "content": "Extract key risks from this response."
        }
      ],
      "temperature": 0.2
    }
  ]
}
generate_batch_response.json
{
  "responses": [
    {
      "choices": [
        {
          "message": {
            "content": "..."
          }
        }
      ],
      "model": "vertex_ai/model",
      "usage": {
        "prompt_tokens": 120,
        "completion_tokens": 85
      }
    },
    {
      "error": "Generation timeout"
    }
  ]
}

Rate limiting is applied through check_rate_limit. For batch calls, one HTTP request counts as one rate-limit unit regardless of item count.