Building an End-to-End LLM Pipeline with AWS SageMaker

I've been putting together a pipeline on AWS that takes raw data, trains a large language model, and serves it for inference — all without managing a single server. The setup spans a few services, but each piece has a clear job. Let me walk you through how it all fits together.

The Pipeline at a Glance

The data flows like this:

AWS Glue extracts raw data and writes it as Parquet to S3
Amazon Athena queries and validates that data on demand
SageMaker Training Jobs consume the data and train the LLM
SageMaker Batch Transform runs batch inference for evaluation
SageMaker Endpoints serve the trained model in production
Downstream services call those endpoints for real-time inference

Each layer does one thing well. Let's go through them.

Extracting Data with AWS Glue

Glue is where the pipeline starts. A Glue Job pulls raw data from the source, applies transformations, and writes it out as Parquet files to an S3 bucket. Parquet is the right format here — it's columnar, compressed, and works seamlessly with everything downstream.

from awsglue.job import Job
from awsglue.context import GlueContext
from pyspark import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)

raw_data = glueContext.create_dynamic_frame_from_catalog(
    database="my_database",
    table_name="raw_events"
)

glueContext.write_dynamic_frame_from_options(
    frame=raw_data,
    connection_type="s3",
    connection_options={
        "path": "s3://my-bucket/training-data/",
        "partitionKeys": ["date"]
    },
    format="parquet"
)

job.commit()

The job runs on a schedule, so the S3 bucket stays fresh without any manual intervention.

Querying with Amazon Athena

Once the Parquet files are in S3, Athena lets me query them with plain SQL — no cluster to spin up, no provisioning. I use it for two things: validating that the Glue transformation produced the right data, and as a lightweight source for feeding into training.

SELECT COUNT(*), MIN(date), MAX(date)
FROM training_data
WHERE date >= '2025-06-01';

Athena query results can be exported back to S3, which keeps the handoff to SageMaker clean and decoupled.

Training the LLM with SageMaker

This is the core of it. SageMaker Training Jobs handle the actual model training — provisioning GPU instances, running distributed training, and saving checkpoints. I use a custom container image that has the training script and dependencies baked in, and point it at the Parquet data in S3.

import sagemaker

session = sagemaker.Session()
role = sagemaker.get_execution_role()

estimator = sagemaker.estimator.Estimator(
    image_uri="123456789012.dkr.ecr.us-east-1.amazonaws.com/llm-trainer:latest",
    role=role,
    instance_count=2,
    instance_type="ml.p3.16xlarge",
    output_path="s3://my-bucket/model-output/"
)

estimator.fit({
    "train": "s3://my-bucket/training-data/",
    "validation": "s3://my-bucket/validation-data/"
})

When the job finishes, the trained model artifacts end up in output_path on S3. SageMaker takes care of the rest — scaling, monitoring, and logging.

Evaluating with Batch Transform

Before anything goes to production, I run the model against a held-out evaluation set. Batch Transform is perfect for this — it runs inference over an entire dataset in one pass, no persistent endpoint needed.

transformer = estimator.transformer(
    instance_count=1,
    instance_type="ml.m5.4xlarge",
    output_path="s3://my-bucket/batch-output/"
)

transformer.transform(
    data="s3://my-bucket/evaluation-data/",
    content_type="application/json",
    split_type="Line"
)
transformer.wait()

The predictions land in S3. I pull them into a notebook or Athena to compute metrics and decide whether the model is ready to deploy.

Deploying with SageMaker Endpoints

Once the model passes evaluation, I deploy it as a real-time endpoint. This gives other services a stable, managed URL to call for inference.

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)

print(predictor.endpoint_name)
# e.g. llm-trainer-2026-02-04-12-30-00-000

SageMaker handles auto-scaling, health checks, and traffic management under the hood. The endpoint just stays up and ready.

Consuming the Endpoint

Other services in the stack call the endpoint directly over HTTPS. From their perspective, it's just an API — they don't need to know anything about the training setup or the underlying infrastructure.

import boto3

client = boto3.client("sagemaker-runtime")

response = client.invoke_endpoint(
    EndpointName="llm-trainer-2026-02-04-12-30-00-000",
    ContentType="application/json",
    Body=b'{"prompt": "Summarize the following text: ..."}'
)

result = response["Body"].read().decode("utf-8")

That's it. The consuming service gets back the model's response and can do whatever it needs with it.

Putting It All Together

The full pipeline runs with minimal babysitting:

Glue keeps the training data fresh in S3
Athena validates and explores it
SageMaker trains new model versions as the data evolves
Batch Transform gates what goes to production
Endpoints serve the live model
Downstream services consume it like any other API

It's a clean way to go from raw data to a production LLM on AWS. Each layer is independently testable, and the whole thing scales without you touching a single server. Definitely worth exploring if you're looking to get ML workloads running on AWS.