Storing Gmail Newsletters in Google Sheets for AI Training

06 Aug, 2025

I wanted to start the AI Literary Cartographer Project by building a daily automation system that extracts content from BookBar newsletters (in Gmail), transforms the messages, and stores the results in Google Sheets. This feature is important as I'll need to capture rich context from all the newsletters I read on a regular basis in order to provide GPT with meaningful background for my personalised book recommendations.

Workflow Overview

The system is built in Pipedream and runs on a daily schedule. Here’s the complete pipeline:

Screenshot 2025-08-06 8

1. Trigger (Daily at 3:10 PM)

User story
As a system, I must automatically initiate the email scraping process daily, so that I can keep the database up to date with the latest BookBar newsletter content.

Steps

Trigger: Pipedream Cron Scheduler
Schedule: Daily at 3:10 PM (Europe/London)

No code needed—just configure the time in the UI.

Screenshot 2025-08-06 8

2. Search Emails from BookBar

User story
As a system, I must search for emails from BookBar in Gmail, so that I can identify which messages need to be processed.

Steps

App: Gmail (Search Email)
Query: from:bookbar
Limit: 100
Output: Email message objects

Screenshot 2025-08-06 8

3. Extract Email Content

User story
As a system, I must extract full MIME content from each email, so that I can access the subject, plain text, HTML, and timestamp.

Steps

App: Gmail (Process Gmail Messages)
Input: {{steps.search_emails_from_bookbar.$return_value}}
Output: MIME content per message

Code

import { axios } from "@pipedream/platform"
import gmail from "@pipedream/gmail"

export default defineComponent({
  name: "Process Gmail Messages Array",
  description: "Process array of Gmail messages from search results and extract raw MIME data and content from each email including nested MIME parts, attachments, HTML content, plain text, headers, and metadata with proper error handling",
  type: "action",
  props: {
    gmail,
    q: {
      propDefinition: [
        gmail,
        "q",
      ],
      description: "Gmail search query to find messages. Use Gmail's standard search operators (e.g., 'from:example@gmail.com', 'has:attachment', 'subject:important')",
    },
    maxResults: {
      type: "integer",
      label: "Max Results",
      description: "Maximum number of messages to process (default: 10, max: 500)",
      optional: true,
      default: 10,
      min: 1,
      max: 500,
    },
    includeAttachmentData: {
      type: "boolean",
      label: "Include Attachment Data",
      description: "Whether to download and include the actual attachment data (base64 encoded). Warning: This may significantly increase processing time and memory usage.",
      optional: true,
      default: false,
    },
  },
  methods: {
    async processMimeParts(parts, messageId) {
      if (!parts || !Array.isArray(parts)) return [];

      const processedParts = [];

      for (const part of parts) {
        const processedPart = {
          partId: part.partId,
          mimeType: part.mimeType,
          filename: part.filename || null,
          headers: part.headers || [],
          body: {
            size: part.body?.size || 0,
            attachmentId: part.body?.attachmentId || null,
          },
        };

        if (part.body?.data) {
          try {
            processedPart.body.data = Buffer.from(part.body.data, 'base64').toString('utf-8');
          } catch (error) {
            processedPart.body.dataError = `Failed to decode body data: ${error.message}`;
          }
        }

        if (part.body?.attachmentId && this.includeAttachmentData) {
          try {
            const attachment = await this.gmail.getAttachment({
              messageId,
              attachmentId: part.body.attachmentId,
            });
            processedPart.body.attachmentData = attachment.data;
            processedPart.body.attachmentSize = attachment.size;
          } catch (error) {
            processedPart.body.attachmentError = `Failed to fetch attachment: ${error.message}`;
          }
        }

        if (part.parts && part.parts.length > 0) {
          processedPart.parts = await this.processMimeParts(part.parts, messageId);
        }

        processedParts.push(processedPart);
      }

      return processedParts;
    },

    extractContentByType(parts, mimeType) {
      if (!parts) return [];

      const content = [];

      for (const part of parts) {
        if (part.mimeType === mimeType && part.body?.data) {
          try {
            content.push({
              partId: part.partId,
              content: Buffer.from(part.body.data, 'base64').toString('utf-8'),
              size: part.body.size,
            });
          } catch (error) {
            content.push({
              partId: part.partId,
              error: `Failed to decode ${mimeType} content: ${error.message}`,
            });
          }
        }

        if (part.parts) {
          content.push(...this.extractContentByType(part.parts, mimeType));
        }
      }

      return content;
    },

    extractAttachments(parts) {
      if (!parts) return [];

      const attachments = [];

      for (const part of parts) {
        if (part.filename && part.body?.attachmentId) {
          attachments.push({
            partId: part.partId,
            filename: part.filename,
            mimeType: part.mimeType,
            attachmentId: part.body.attachmentId,
            size: part.body.size,
          });
        }

        if (part.parts) {
          attachments.push(...this.extractAttachments(part.parts));
        }
      }

      return attachments;
    },

    extractHeaders(message) {
      const headers = {};
      const headersList = message.payload?.headers || [];

      for (const header of headersList) {
        const key = header.name.toLowerCase();
        headers[key] = header.value;
      }

      return headers;
    },
  },

  async run({ $ }) {
    try {
      $.export("$summary", "Searching for Gmail messages...");

      const searchResults = await this.gmail.listMessages({
        q: this.q,
        maxResults: this.maxResults,
      });

      if (!searchResults.messages || searchResults.messages.length === 0) {
        $.export("$summary", "No messages found matching the search criteria");
        return {
          totalMessages: 0,
          messages: [],
        };
      }

      $.export("$summary", `Processing ${searchResults.messages.length} messages...`);

      const processedMessages = [];
      let successCount = 0;
      let errorCount = 0;

      for (let i = 0; i < searchResults.messages.length; i++) {
        const messageRef = searchResults.messages[i];

        try {
          const [fullMessage, rawMessage] = await Promise.all([
            this.gmail.getMessage({ id: messageRef.id }),
            this.gmail.getMessage({ id: messageRef.id, format: 'raw' }).catch(() => null),
          ]);

          const headers = this.extractHeaders(fullMessage);
          const mimeParts = await this.processMimeParts(fullMessage.payload.parts, messageRef.id);

          const textContentRaw = this.extractContentByType(fullMessage.payload.parts, 'text/plain');
          const htmlContentRaw = this.extractContentByType(fullMessage.payload.parts, 'text/html');

          const textContent = textContentRaw.map(p => p.content).join('\n') || '';
          const htmlContent = htmlContentRaw.map(p => p.content).join('\n') || '';

          const attachments = this.extractAttachments(fullMessage.payload.parts);

          const processedMessage = {
            id: fullMessage.id,
            threadId: fullMessage.threadId,
            labelIds: fullMessage.labelIds || [],
            snippet: fullMessage.snippet,
            historyId: fullMessage.historyId,
            internalDate: fullMessage.internalDate,
            sizeEstimate: fullMessage.sizeEstimate,
            headers,
            subject: headers.subject || '',
            from: headers.from || '',
            to: headers.to || '',
            date: headers.date || '',
            rawMimeData: rawMessage?.raw || null,
            payload: {
              partId: fullMessage.payload.partId,
              mimeType: fullMessage.payload.mimeType,
              filename: fullMessage.payload.filename,
              headers: fullMessage.payload.headers,
              parts: mimeParts,
            },
            content: {
              html: htmlContent,
              text: textContent,
            },
            attachments,
            processedAt: new Date().toISOString(),
            processingIndex: i + 1,
          };

          processedMessages.push(processedMessage);
          successCount++;

        } catch (error) {
          errorCount++;
          processedMessages.push({
            id: messageRef.id,
            error: error.message,
            errorType: error.constructor.name,
            processedAt: new Date().toISOString(),
            processingIndex: i + 1,
          });
        }
      }

      const summary = `Successfully processed ${successCount} messages, ${errorCount} errors`;
      $.export("$summary", summary);

      return {
        searchQuery: this.q,
        totalMessages: searchResults.messages.length,
        successCount,
        errorCount,
        includeAttachmentData: this.includeAttachmentData,
        processedAt: new Date().toISOString(),
        messages: processedMessages,
      };

    } catch (error) {
      throw new Error(`Failed to process Gmail messages: ${error.message}`);
    }
  },
});

Screenshot 2025-08-06 8

4. Flatten Email Data (Node.js)

User story
As a system, I must flatten the nested structure of Gmail MIME messages, so that I can work with a clean array of objects.

Steps

Language: Node.js 20.x
Step Name: flatten_email_data
Input: {{steps.extract_email_content.$return_value.messages}}

Code

export default defineComponent({
  props: {
    emailMessages: {
      type: "any",
      label: "Email Messages",
    },
  },
  async run({ $ }) {
    return this.emailMessages.map(msg => ({
      timestamp: msg.internalDate,
      subject: msg.payload.headers.find(h => h.name === "Subject")?.value || "(No Subject)",
      text: Buffer.from(msg.payload.parts?.[0]?.body?.data || "", "base64").toString("utf-8"),
      html: Buffer.from(msg.payload.parts?.[1]?.body?.data || "", "base64").toString("utf-8")
    }));
  }
});

Screenshot 2025-08-06 8

5. Transform for Sheets (Node.js)

User story
As a system, I must transform the email objects into row arrays, so that I can write them to Google Sheets.

Steps

Language: Node.js 20.x
Step Name: transform_for_sheets
Input: {{steps.flatten_email_data.$return_value}}

Code

export default defineComponent({
  props: {
    flattenedEmails: {
      type: "any",
      label: "Flattened Email Data"
    }
  },
  async run({ $ }) {
    return this.flattenedEmails.map(email => [
      email.timestamp,
      email.subject,
      email.text,
      email.html
    ]);
  }
});

Screenshot 2025-08-06 8

6. Add Rows to Google Sheet

User story
As a system, I must store each email as a new row in Google Sheets, so that the data can be used later by the AI Literary Cartographer.

Steps

App: Google Sheets
Action: Add Multiple Rows
Target: ai_lc_newsletter_bookbar
Input: {{steps.transform_for_sheets.$return_value}}

Screenshot 2025-08-06 8

💬 Reflections

This was a surprisingly complex workflow 🥵.

what I learned (the hard way):

MIME decoding issues were the biggest blockers for sure. Originally I tried to use libraries like quoted-printable or base64-arraybuffer, but these caused syntax errors in Pipedream’s Node.js runtime. I ended up using native Buffer.from(..., "base64") decoding instead.
Step configuration syntax is important. I kept hitting expected X to be an array errors due to incorrect return path usage like {{steps.extract_email_content}} instead of {{steps.extract_email_content.$return_value.messages}}.
Visual debugging was critical. Screenshots and raw console logs helped confirm when payloads were successfully decoded.
Clear separation of logic across steps prevented me from getting overwhelmed (kind of), each Node.js component had a single purpose.
The final pipeline is now highly reusable and can be cloned and updated for any newsletter source, Daunt Books, Fitzcarraldo etc.

What’s next

Add support for additional newsletters
Add a filter to detect and skip duplicate emails
Feed structured data into the GPT agent (via embeddings or prompt context)