Skip to content

Latest commit

 

History

History
548 lines (432 loc) · 19 KB

File metadata and controls

548 lines (432 loc) · 19 KB
title File Processors Guide
description Comprehensive file processing system supporting 20+ file types with intelligent content extraction, security sanitization, and provider-agnostic formatting
keywords
file-processors
multimodal
content-extraction
file-types
processor-registry
document-processing
security-sanitization

File Processors Guide

NeuroLink includes a comprehensive file processing system that supports 20+ file types with intelligent content extraction, security sanitization, and provider-agnostic formatting. This system enables seamless multimodal AI interactions across all 13 supported providers.

Overview

The file processor system is organized into a modular architecture:

src/lib/processors/
├── base/           # BaseFileProcessor abstract class and types
├── registry/       # ProcessorRegistry singleton for processor selection
├── config/         # MIME types, extensions, language maps, size limits
├── errors/         # FileErrorCode enum and error helpers
├── document/       # Excel, Word, RTF, OpenDocument processors
├── media/          # Video and Audio processors (metadata extraction)
├── archive/        # ZIP, TAR, GZ archive processors (file listing + content extraction)
├── markup/         # SVG, HTML, Markdown, Text processors
├── code/           # SourceCode, Config processors
├── data/           # JSON, YAML, XML processors
├── integration/    # FileProcessorIntegration for registry usage
└── cli/            # CLI helpers for file processing

Quick Start

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

const result = await neurolink.generate({
  input: {
    text: "Summarize this document",
    files: ["./report.pdf"],
  },
});

console.log(result.content);

Supported File Types

Documents

Type Extensions Processor Features
Excel .xlsx, .xls ExcelProcessor Multi-sheet extraction, cell formatting, data tables
Word .docx, .doc WordProcessor Text extraction, paragraph preservation
RTF .rtf RtfProcessor Rich text to plain text conversion
OpenDocument .odt, .ods, .odp OpenDocumentProcessor LibreOffice/OpenOffice format support

Data Files

Type Extensions Processor Features
JSON .json JsonProcessor Validation, pretty-printing, syntax highlighting
YAML .yaml, .yml YamlProcessor Validation, formatting, multi-document support
XML .xml XmlProcessor Parsing, validation, entity handling

Markup Files

Type Extensions Processor Features
HTML .html, .htm HtmlProcessor OWASP-compliant sanitization, text extraction
SVG .svg SvgProcessor XSS prevention, text injection (not binary)
Markdown .md, .markdown MarkdownProcessor Formatting preservation, metadata extraction
Text .txt TextProcessor Plain text handling, encoding detection

Source Code

Type Extensions Processor Features
TypeScript .ts, .tsx SourceCodeProcessor Language detection, syntax metadata
JavaScript .js, .jsx, .mjs SourceCodeProcessor Module detection
Python .py SourceCodeProcessor Docstring preservation
Java .java SourceCodeProcessor Package detection
Go .go SourceCodeProcessor Module awareness
Rust .rs SourceCodeProcessor Crate detection
C/C++ .c, .cpp, .h, .hpp SourceCodeProcessor Header handling
C# .cs SourceCodeProcessor Namespace detection
Ruby .rb SourceCodeProcessor Gem awareness
PHP .php SourceCodeProcessor Tag handling
Swift .swift SourceCodeProcessor Framework detection
Kotlin .kt, .kts SourceCodeProcessor Android/JVM awareness
Scala .scala SourceCodeProcessor SBT integration
Shell .sh, .bash, .zsh SourceCodeProcessor Shebang detection
SQL .sql SourceCodeProcessor Dialect hints
And 35+ more... Various SourceCodeProcessor Automatic language detection

Configuration Files

Type Extensions Processor Features
Environment .env, .env.* ConfigProcessor Secret masking option
INI .ini, .cfg ConfigProcessor Section parsing
TOML .toml ConfigProcessor Cargo.toml, pyproject.toml support
Properties .properties ConfigProcessor Java properties format

Media Files

Type Extensions Processor Features
Video .mp4, .mkv, .webm, .avi, .mov, .m4v VideoProcessor Duration, resolution, codec, frame rate, bitrate extraction via music-metadata
Audio .mp3, .wav, .ogg, .flac, .aac, .m4a, .wma AudioProcessor Codec, bitrate, sample rate, channels, duration extraction via music-metadata

Video and audio files are not sent as binary to the AI provider. Instead, the processors extract structured metadata and return it as formatted text, keeping token usage minimal (~50-200 tokens per file).

Example video output:

Video File: presentation.mp4
Duration: 13s | Resolution: 640x360 | Video Codec: h264
Frame Rate: 29.97 fps | Bitrate: 345 kbps
Audio: aac, 48000 Hz, 2 channels

Example audio output:

Audio File: recording.mp3
Codec: MPEG 1 Layer 3 | Bitrate: 128 kbps
Sample Rate: 44100 Hz | Channels: 2 (Stereo) | Duration: 1:46

Archives

Type Extensions Processor Features
ZIP .zip ArchiveProcessor File listing with sizes, nested content extraction, ZIP bomb detection
TAR .tar ArchiveProcessor File listing with sizes
GZ .gz, .tar.gz, .tgz ArchiveProcessor Gzip decompression, tar content listing

Archive files return a structured listing of their contents with file sizes and optionally extract text from contained files (routing through existing processors).

Example archive output:

Archive: project.tar.gz
Total entries: 6

Files:
- code/sample.json (60 B)
- code/sample.py (195 B)
- document/sample.txt (607 B)

Security: Archive processing includes ZIP bomb detection (compression ratio limits), path traversal prevention, symlink blocking, entry count limits, and aggregate decompression size limits.

Usage

SDK Usage

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

// Process multiple file types in a single request
const result = await neurolink.generate({
  input: {
    text: "Analyze these files and summarize the key information",
    files: [
      "./data/report.xlsx", // Excel spreadsheet
      "./config/settings.yaml", // YAML configuration
      "./src/main.ts", // TypeScript source
      "./docs/architecture.svg", // SVG diagram (injected as text)
      "./api/schema.json", // JSON schema
    ],
  },
  provider: "vertex",
});

console.log(result.content);

CLI Usage

# Single file
neurolink generate "Analyze this spreadsheet" --file ./data.xlsx

# Multiple files
neurolink generate "Compare these configs" \
  --file ./config.yaml \
  --file ./settings.json \
  --file ./app.toml

# Mixed with images and PDFs
neurolink generate "Explain this codebase" \
  --file ./src/main.ts \
  --file ./docs/diagram.svg \
  --pdf ./docs/spec.pdf \
  --image ./screenshot.png

Stream Mode

// Streaming with file processing
const result = await neurolink.stream({
  input: {
    text: "Walk me through this code step by step",
    files: ["./src/algorithm.py"],
  },
});

for await (const chunk of result.stream) {
  if ("content" in chunk) {
    process.stdout.write(chunk.content);
  }
}

Architecture

ProcessorRegistry

The ProcessorRegistry is a singleton that manages all file processors with priority-based selection:

import { ProcessorRegistry } from "@juspay/neurolink";

// Get the singleton instance
const registry = ProcessorRegistry.getInstance();

// Register a custom processor (lower priority = higher precedence)
registry.register(new MyCustomProcessor(), 50);

// Find processor for a file
const processor = registry.findProcessor({
  filename: "data.xlsx",
  mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  size: 1024,
});

// Process a file
const result = await processor.process(fileInfo, fileContent);

BaseFileProcessor

All processors extend the abstract BaseFileProcessor class:

import { BaseFileProcessor, FileInfo, ProcessedFile } from "@juspay/neurolink";

export class MyProcessor extends BaseFileProcessor {
  readonly name = "my-processor";
  readonly supportedMimeTypes = ["application/x-my-format"];
  readonly supportedExtensions = [".myf"];

  canProcess(file: FileInfo): boolean {
    return this.supportedExtensions.includes(file.extension);
  }

  async process(file: FileInfo, content: Buffer): Promise<ProcessedFile> {
    const text = this.extractText(content);
    return {
      type: "text",
      content: text,
      metadata: {
        processor: this.name,
        originalFilename: file.filename,
      },
    };
  }

  getInfo(): ProcessorInfo {
    return {
      name: this.name,
      description: "Processes MY format files",
      supportedMimeTypes: this.supportedMimeTypes,
      supportedExtensions: this.supportedExtensions,
    };
  }
}

FileDetector

The FileDetector utility automatically identifies file types:

import { FileDetector } from "@juspay/neurolink";

const detector = new FileDetector();

// Detect by extension
const type1 = detector.detect("report.xlsx");
// Returns: { type: "xlsx", mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" }

// Detect by content (magic bytes)
const type2 = detector.detectFromContent(buffer);

// SVG special handling - returns "svg" type, not "image"
const type3 = detector.detect("diagram.svg");
// Returns: { type: "svg", mimeType: "image/svg+xml" }

Security Features

OWASP-Compliant Sanitization

The markup processors include security sanitization to prevent XSS and injection attacks:

HTML Sanitization

// HtmlProcessor automatically sanitizes HTML content
// - Removes <script> tags
// - Strips event handlers (onclick, onerror, etc.)
// - Removes javascript: URLs
// - Sanitizes style attributes
// - Blocks dangerous protocols

const result = await neurolink.generate({
  input: {
    text: "Summarize this HTML content",
    files: ["./untrusted-content.html"], // Automatically sanitized
  },
});

SVG Sanitization

// SvgProcessor sanitizes SVG before injection
// - Removes embedded scripts
// - Strips foreignObject elements
// - Sanitizes use/href attributes
// - Blocks external entity references

// SVG is injected as TEXT, not as binary image
// This prevents image-based attacks while preserving vector content

File Size Limits

Default size limits prevent denial-of-service attacks:

Category Default Limit Configurable
Documents 50 MB Yes
Data files 10 MB Yes
Code files 5 MB Yes
Config files 1 MB Yes
Images 20 MB Yes
import { ProcessorConfig } from "@juspay/neurolink";

// Configure size limits
ProcessorConfig.setLimits({
  maxDocumentSize: 100 * 1024 * 1024, // 100 MB
  maxCodeSize: 10 * 1024 * 1024, // 10 MB
});

Error Handling

FileErrorCode Enum

import { FileErrorCode } from "@juspay/neurolink";

try {
  const result = await neurolink.generate({
    input: { files: ["./corrupted.xlsx"] },
  });
} catch (error) {
  if (error && typeof error === "object" && "code" in error) {
    switch (error.code) {
      case FileErrorCode.UNSUPPORTED_TYPE:
        console.log("File type not supported");
        break;
      case FileErrorCode.FILE_TOO_LARGE:
        console.log("File too large");
        break;
      case FileErrorCode.CORRUPTED_FILE:
        console.log("File is corrupted");
        break;
      case FileErrorCode.DOWNLOAD_AUTH_FAILED:
        console.log("Cannot read file");
        break;
    }
  }
}

Provider Compatibility

All file processors work across all 13 AI providers. The processed content is formatted as text that any provider can understand:

Provider Documents Data Markup Code Config
OpenAI
Anthropic
Google AI Studio
Google Vertex
AWS Bedrock
Azure OpenAI
Mistral
LiteLLM
Ollama
Hugging Face
SageMaker
OpenAI Compatible
OpenRouter

Note: For binary files like images and PDFs, provider-specific adapters handle the formatting. See PDF Support and Multimodal Chat.

Best Practices

1. Use Appropriate File Types

// Good: Use structured data formats for data
files: ["./data.json", "./config.yaml"];

// Avoid: Using unstructured text for structured data
files: ["./data.txt"]; // Harder for AI to parse

2. Combine Related Files

// Good: Group related files together
const result = await neurolink.generate({
  input: {
    text: "Review this module for best practices",
    files: [
      "./src/module.ts", // Implementation
      "./src/module.test.ts", // Tests
      "./src/module.types.ts", // Types
    ],
  },
});

3. Be Mindful of Token Limits

// For large files, consider chunking or summarization
import { ProcessorConfig } from "@juspay/neurolink";

// Enable automatic truncation for very large files
ProcessorConfig.setTruncation({
  enabled: true,
  maxTokens: 50000,
  strategy: "head-tail", // Keep beginning and end
});

4. Use Specific Prompts

// Good: Be specific about what to analyze
const result = await neurolink.generate({
  input: {
    text: "Find security vulnerabilities in this code, focusing on SQL injection and XSS",
    files: ["./src/api.ts"],
  },
});

// Less effective: Vague prompt
const result = await neurolink.generate({
  input: {
    text: "Look at this",
    files: ["./src/api.ts"],
  },
});

Extending the System

Creating a Custom Processor

import {
  BaseFileProcessor,
  FileInfo,
  ProcessedFile,
  ProcessorRegistry,
} from "@juspay/neurolink";

class ProtobufProcessor extends BaseFileProcessor {
  readonly name = "protobuf-processor";
  readonly supportedMimeTypes = ["application/x-protobuf"];
  readonly supportedExtensions = [".proto"];

  canProcess(file: FileInfo): boolean {
    return file.extension === ".proto";
  }

  async process(file: FileInfo, content: Buffer): Promise<ProcessedFile> {
    const protoText = content.toString("utf-8");

    // Add syntax highlighting hints
    const formatted = `\`\`\`protobuf\n${protoText}\n\`\`\``;

    return {
      type: "text",
      content: formatted,
      metadata: {
        processor: this.name,
        language: "protobuf",
        filename: file.filename,
      },
    };
  }

  getInfo() {
    return {
      name: this.name,
      description: "Processes Protocol Buffer definition files",
      supportedMimeTypes: this.supportedMimeTypes,
      supportedExtensions: this.supportedExtensions,
    };
  }
}

// Register with priority 50 (lower = higher precedence)
ProcessorRegistry.getInstance().register(new ProtobufProcessor(), 50);

Related Documentation