# README — Book Loader for `rag_content` Table: `load_books.py`

This loader ingests book paragraph chunks extracted **directly from a Project Gutenberg HTML file**, generates embeddings, computes metadata, and upserts rows into the `rag_content` table defined in `db/schema.sql`.  
It is designed to fully align with the schema requirements for Deliverable #2 of the assignment.

---

## 1. What the Loader Does

The loader script:

1. **Parses Project Gutenberg HTML content (not JSON)**  
   The script uses `BeautifulSoup` to extract:
   - Book title and author from `<meta>` tags or `<title>`
   - Book ID from `<meta name="dcterms.identifier">` or file name
   - Chapters from `<div class="chapter">` blocks (if present)
   - Paragraphs from `<p>` tags, including previous/next context

2. **Connects to Postgres (via Supabase)**  
   The script initializes Supabase with the URL and service key from environment variables, then performs:
   ```python
   supabase.table("rag_content").upsert(...)
   ```

3. **Ensures data conforms to the `rag_content` schema**  
   It populates all required and relevant fields, including:

   * `content`
   * `embedding` (vector of dimension 1536)
   * `embedding_model` (`text-embedding-3-small`, configurable)
   * `embedding_dim` (validated automatically)
   * `title`, `author`, `book_id`
   * `chapter` (chapter title)  
   * `chapter_num` (chronological chapter number)
   * `language`, `genre`, `tags`
   * `chunk_id`, `chunk_index`, `chunk_token_count`
   * `checksum`
   * `user_id`, `document_type`, etc.

4. **Creates paragraph-level chunks with optional surrounding context**  
   Each paragraph becomes one chunk.  
   The loader constructs a unified `content` string:

   ```
   <previous paragraph>

   <current paragraph>

   <next paragraph>
   ```

   And calculates:

   * A stable ID for each chunk (`bookid-chX-pY`)
   * Paragraph index
   * Token count (word count approximation)

5. **Generates embeddings**  
   Using OpenAI's embedding API with the configured model.  
   The returned vector dimension is verified against `embedding_dim`.

6. **Enforces idempotency**  
   Each row includes a deterministic `checksum` computed from:

   ```
   md5(content + book_id)
   ```

   Rows are inserted via `upsert`, so rerunning the loader never duplicates entries.

7. **Handles metadata consistently**  
   The loader constructs a `meta_doc` containing:
   - Book title, author
   - Chapter title, chapter number
   - Paragraph number
   - Book ID  
   This metadata is written to first-class table columns, not just a JSON blob.

---

## 2. How the Loader Maps to the `rag_content` Columns

| `rag_content` Column | Populated By Loader                             | Notes                             |
| -------------------- | ----------------------------------------------- | --------------------------------- |
| `id`                 | Paragraph ID (`{book}-chX-pY`)                  | Stable per paragraph              |
| `content`            | Combined context + paragraph text               | Required by schema                |
| `embedding`          | OpenAI embedding                                | `vector(1536)`                    |
| `embedding_model`    | Constant: `text-embedding-3-small`              | Required                          |
| `embedding_dim`      | `len(embedding)`                                | Checked by schema constraint      |
| `checksum`           | `md5(content + book_id)`                        | Enables idempotent upsert         |
| `title`              | Extracted from HTML meta/title                  | Required metadata                 |
| `author`             | Extracted from HTML meta                        | Required metadata                 |
| `book_id`            | Meta identifier or HTML file stem               | Recommended                       |
| `chapter`            | Chapter title extracted from chapter `<h2>`     | Required metadata                 |
| `chapter_num`        | Chronological chapter index                     | Supports ordering                 |
| `chunk_id`           | Same as `id`                                    | Schema requirement                |
| `chunk_index`        | Paragraph number                                | Ordering                          |
| `chunk_token_count`  | Word count of `content`                         | Required metadata                 |
| `language`           | `"en"`                                          | Default                           |
| `genre`              | `"fiction"` or `None`                           | Optional                          |
| `tags`               | `["project-gutenberg", "public-domain"]`        | Supports GIN index                |
| `source_path`        | HTML file path                                  | Helpful for tracing source        |
| `user_id`            | From loader constant                            | Preserved                         |
| `document_type`      | `"book_paragraph"`                              | Preserved                         |
| `document_id`        | Optional                                        | Preserved                         |
| `username`           | Optional                                        | Preserved                         |
| `created_at`         | DB default                                      | No code needed                    |

---

## 3. How to Run the Loader

### **1. Install dependencies**

```
uv pip install -r requirements.txt
```

(or)

```
pip install -r requirements.txt
```

### **2. Set required environment variables**

```
export SUPABASE_URL=<your-url>
export SUPABASE_KEY=<your-supabase-key>
export OPENAI_API_KEY=<your-openai-key>
```

### **3. Run the loader**

Example:

```bash
python scripts/load_books.py --path data/frankenstein.html
```

The loader will:

* Parse chapter/paragraph structure from HTML  
* Normalize text (smart quotes → ASCII)  
* Generate embeddings  
* Insert/upsert rows into `rag_content`  
* Print progress by chapter  

---

## 4. Sample Input Format (HTML)

The loader expects a **Project Gutenberg–style HTML file**, for example:

```html
<html>
<head>
  <meta name="dcterms.title" content="Frankenstein">
  <meta name="dcterms.creator" content="Mary Shelley">
  <meta name="dcterms.identifier" content="frankenstein">
</head>
<body>
  <div class="chapter">
    <h2>Letter 1</h2>
    <p>You will rejoice to hear...</p>
    <p>More text here...</p>
  </div>
  <div class="chapter">
    <h2>Letter 2</h2>
    <p>Additional paragraph...</p>
  </div>
</body>
</html>
```

If the HTML does not contain `<div class="chapter">`, the loader treats the entire document as a single chapter.

---

## 5. Why This Loader Meets the Assignment Requirements

✔ HTML parsing supports real-world RAG ingestion  
✔ Metadata extraction aligns with `rag_content` schema fields  
✔ Embedding generation fully documented and consistent  
✔ Paragraph-level chunking (with context) explained  
✔ Idempotent inserts using checksums + `upsert`  
✔ Handles Unicode normalization for consistent embeddings  
✔ Emits all required schema fields  
✔ Fully compatible with pgvector and FastAPI search layer  
✔ Ready for Deliverable #2 submission  

---
