EnhancementBuild Custom Enhancement PipelineBuild FAQ Generator From PDFs

Build FAQ Generator From PDFs

A guide to creating an automated FAQ generator that processes PDF documents and extracts structured Q&A pairs using Unbody’s Enhancement Pipeline.

Pipeline Setup

Since we’re only targeting PDFs for FAQ generation, we create a pipeline with type filtering. This ensures efficient resource usage by only handling PDF files within the TextDocument collection:

// Initialize pipeline for PDF document processing
const faqPipeline = new Enhancement.Pipeline(
  "generate_faqs",      // Unique pipeline identifier
  "TextDocument",       // Target collection
  {
    // Process only PDF files using MIME type check
    if: (ctx) => ctx.record.mimeType === "application/pdf",
  }
);

FAQ Generation Configuration

Define the enhancement pipeline step for FAQ extraction. Using StructuredGenerator action with Zod schema validation, we process document content into structured Q&A pairs:

// Create FAQ generation step with structured output
const generateFAQs = new Enhancement.Step(
  "extract_faqs",    // Step identifier
  new Enhancement.Action.StructuredGenerator({
    model: "openai-gpt-4o",
    // Dynamic prompt generation using document context
    prompt: (ctx) => `
      Generate a list of FAQs based on this document:
      Title: ${ctx.record.title || "Untitled"}
      Content: ${ctx.record.text}
      
      Extract key questions and their answers from the content.
    `,
    // Define structured output schema for FAQs
    schema: (ctx, { z }) =>
      z.object({
        faqs: z.array(
          z.object({
            question: z.string(),
            answer: z.string(),
          })
        ).describe("List of FAQs extracted from the document"),
      }),
  }),
  {
    // Transform and store results
    output: {
      xFAQs: (ctx) => JSON.stringify(ctx.result.json.faqs),
    },
  }
);
 
// Add generation step to pipeline
faqPipeline.add(generateFAQs);

Project Setup

Set up project settings with necessary AI capabilities and custom schema for storing FAQs:

  • Configure vectorizers and AI models for processing
  • Add custom schema to extend TextDocument collection with FAQ storage field
// Initialize project settings with required capabilities
const projectSettings = new ProjectSettings();
projectSettings
  .set(new TextVectorizer(TextVectorizer.OpenAI.TextEmbedding3Small))
  .set(new Generative(Generative.OpenAI.GPT4o))
  .set(new AutoSummary(AutoSummary.OpenAI.GPT4o))
  .set(new AutoVision(AutoVision.OpenAI.GPT4o))
  .set(enhancement)
  .set(
    // Add custom schema for FAQ storage
    new CustomSchema().extend(
      new CustomSchema.Collection("TextDocument").add(
        new CustomSchema.Field.Text(
          "xFAQs",
          "FAQs generated from the document"
        )
      )
    )
  );
 
// Create and save project with configured settings
const project = admin.projects.ref({
  name: "PDF FAQ Extraction Project",
  settings: projectSettings,
});
 
await project.save();

Data Retrieval

Use TypeScript interfaces to ensure type safety when querying enhanced documents. The query targets PDF documents and retrieves their generated FAQs along with metadata:

// Define interface for enhanced document type
interface ExtendedTextDocument extends ITextDocument {
  xFAQs: StringField;
}
 
// Query documents with generated FAQs
const {
  data: { payload },
} = await unbody.get
  .collection<ExtendedTextDocument>("TextDocument")
  .where({
    mimeType: "application/pdf"
  })
  .select("title", "originalName", "xFAQs")
  .exec();

Learn More

©2024 Unbody