Skip to content

Binary File Detection

Mikhail Deynekin edited this page Dec 23, 2025 · 2 revisions

Binary File Detection

Automatically detect and skip binary files to prevent corruption of output and maintain data integrity.

Table of Contents

Overview

Multi-layer detection prevents accidental modification of binary files like images, archives, executables. sr-search-replace uses a sophisticated multi-layer approach to identify binary content:

  1. File Extension Analysis - Quick check using known binary extensions
  2. Magic Number Detection - Identifies file types by byte signatures
  3. Character Encoding Detection - Analyzes byte patterns for text encoding
  4. Statistical Analysis - Examines null bytes and character distribution

Detection Algorithm Flowchart

The binary detection process follows this comprehensive algorithm:

┌─────────────────────────────────────┐
│  FILE RECEIVED FOR PROCESSING       │
└──────────────┬──────────────────────┘
               │
               ▼
        ┌──────────────────┐
        │  Check if file   │
        │  is readable     │
        │  and accessible  │
        └──────┬───────────┘
               │
        ┌──────▼───────────┐
        │   SUCCESS?       │
        └─┬────────────┬───┘
          │ NO         │ YES
          │            │
          ▼            ▼
      ┌─────────┐   ┌──────────────────────┐
      │  ERROR  │   │ STEP 1: EXTENSION    │
      │  SKIP   │   │ Check file extension │
      └─────────┘   │ against binary list  │
                    └──────┬───────────────┘
                           │
                    ┌──────▼──────────┐
                    │ BINARY EXT?     │
                    └─┬───────────┬───┘
                      │ YES       │ NO
                      │           │
                      ▼           ▼
                  ┌─────────┐   ┌──────────────────────┐
                  │ SKIP    │   │ STEP 2: MAGIC NUMBER │
                  │ BINARY  │   │ Read first 512 bytes │
                  └─────────┘   │ Check magic signatures
                                └──────┬───────────────┘
                                       │
                                ┌──────▼──────────┐
                                │ MAGIC MATCH?    │
                                └─┬───────────┬───┘
                                  │ YES       │ NO
                                  │           │
                                  ▼           ▼
                              ┌─────────┐   ┌──────────────────────┐
                              │ SKIP    │   │ STEP 3: ENCODING     │
                              │ BINARY  │   │ Check byte patterns  │
                              └─────────┘   │ for UTF-8/ASCII      │
                                            └──────┬───────────────┘
                                                   │
                                            ┌──────▼──────────┐
                                            │ NULL BYTES?     │
                                            └─┬───────────┬───┘
                                              │ YES       │ NO
                                              │           │
                                              ▼           ▼
                                          ┌─────────┐   ┌──────────────────────┐
                                          │ SKIP    │   │ STEP 4: STATISTICS   │
                                          │ BINARY  │   │ Analyze byte distrib.│
                                          └─────────┘   │ Count control chars  │
                                                        └──────┬───────────────┘
                                                               │
                                                        ┌──────▼──────────┐
                                                        │ HIGH CONTROL    │
                                                        │ CHAR COUNT?     │
                                                        └─┬───────────┬───┘
                                                          │ YES       │ NO
                                                          │           │
                                                          ▼           ▼
                                                      ┌─────────┐   ┌──────────────┐
                                                      │ SKIP    │   │ TREAT AS     │
                                                      │ BINARY  │   │ TEXT / PROCESS
                                                      └─────────┘   └──────────────┘

Enable Binary Detection

Use the --skip-binary flag to enable multi-layer binary file detection:

sr --find "pattern" --replace "text" --skip-binary --recursive .

Or with file pattern:

sr --find "pattern" --replace "text" --skip-binary *.txt

Detection Methods

sr-search-replace employs four complementary detection methods:

1. Null Byte Scanning

The most reliable and fastest method - scans for null bytes (0x00) which indicate binary content:

# Any null byte indicates binary file
if grep -q $'\x00' file.bin; then echo "BINARY"; fi

Effectiveness: 99%+
Speed: Instant (first null byte found)
False Positives: Extremely rare

2. Magic Number Analysis

Identifies file types by reading file signatures (magic bytes) from file headers:

# Examples of magic numbers:
# PNG:  89 50 4E 47 (hex) = \x89PNG
# JPEG: FF D8 FF (hex)
# ZIP:  50 4B 03 04 (hex) = PK
# ELF:  7F 45 4C 46 (hex) = .ELF

Common Binary Signatures:

  • Images: PNG, JPG, GIF, BMP
  • Archives: ZIP, RAR, TAR, GZIP
  • Executables: ELF, Mach-O, PE (Windows)
  • Documents: PDF, Office formats

3. Character Encoding Detection

Analyzes byte sequences to detect if file is valid UTF-8 or ASCII:

# Check for valid UTF-8 encoding
if file -i file.txt | grep -q "charset=iso-8859-1\|charset=unknown"; then
    echo "POTENTIALLY BINARY"
fi

Indicators of Binary:

  • Invalid UTF-8 sequences
  • High proportion of control characters (0x00-0x1F)
  • Invalid character sequences
  • Mixed or unknown encoding

4. Statistical Analysis

Examines overall byte distribution and control character frequency:

# Count control characters (0x00-0x1F)
binary_count=$(od -An -tx1 file.txt | grep -o "[0-1][0-9a-f]" | wc -l)
if [ $binary_count -gt $threshold ]; then
    echo "BINARY"
fi

Thresholds:

  • < 1% control chars: Likely text
  • 1-5% control chars: Borderline (use other methods)
  • 5% control chars: Likely binary

Configuration

Environment Variables

# Enable or disable binary detection
export SR_SKIP_BINARY="true"

# Custom binary file extensions (colon-separated)
export SR_BINARY_EXTENSIONS=".bin:.exe:.dll:.so:.dylib"

# Binary detection threshold (percentage of control chars)
export SR_BINARY_THRESHOLD="5"

# Timeout for binary detection (milliseconds)
export SR_BINARY_TIMEOUT="1000"

# Enable detailed binary detection logging
export SR_DEBUG_BINARY="true"

Configuration File

{
  "skipBinary": true,
  "binaryExtensions": [
    ".bin", ".exe", ".dll", ".so", ".dylib",
    ".jpg", ".jpeg", ".png", ".gif", ".bmp",
    ".zip", ".rar", ".tar", ".gz", ".7z",
    ".pdf", ".doc", ".docx", ".xls", ".xlsx"
  ],
  "binaryThreshold": 5,
  "binaryTimeout": 1000,
  "debugBinary": false
}

Direct Code Configuration

In Python implementation, modify these constants at the top of sr.py:

# Binary detection configuration
BINARY_THRESHOLD = 5  # Percentage of control chars (0x00-0x1F)
BINARY_TIMEOUT = 1000  # Milliseconds
SKIP_BINARY = True  # Enable by default

# Known binary extensions
BINARY_EXTENSIONS = {
    # Images
    '.jpg', '.jpeg', '.png', '.gif', '.bmp', '.ico', '.tiff', '.webp',
    # Archives
    '.zip', '.rar', '.7z', '.tar', '.gz', '.bz2', '.xz',
    # Executables
    '.exe', '.dll', '.so', '.dylib', '.o', '.a', '.lib',
    # Documents
    '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx',
    # Media
    '.mp3', '.mp4', '.avi', '.mov', '.flv', '.mkv', '.wav'
}

# Magic number signatures for file identification
MAGIC_NUMBERS = {
    b'\x89PNG': 'png',
    b'\xff\xd8\xff': 'jpeg',
    b'GIF87a': 'gif',
    b'GIF89a': 'gif',
    b'BM': 'bmp',
    b'%PDF': 'pdf',
    b'PK\x03\x04': 'zip',
    b'Rar!': 'rar',
    b'\x7fELF': 'elf',
    b'MZ': 'pe',
}

In Bash implementation, modify sr.sh:

# Binary detection configuration
readonly BINARY_THRESHOLD=5
readonly BINARY_TIMEOUT=1000
readonly SKIP_BINARY=true

# Known binary extensions
readonly BINARY_EXTENSIONS="bin exe dll so dylib jpg jpeg png gif bmp zip rar tar gz pdf"

# Function to check if file is binary
is_binary() {
    local file="$1"
    local ext="${file##*.}"
    
    # Check extension
    if [[ "$BINARY_EXTENSIONS" =~ $ext ]]; then
        return 0  # Binary
    fi
    
    # Check for null bytes
    if grep -q $'\x00' "$file" 2>/dev/null; then
        return 0  # Binary
    fi
    
    return 1  # Text
}

Supported Formats

Automatically Detected:

Images: PNG, JPG, GIF, BMP, ICO, TIFF, WebP, SVG (SVG treated as text)

Archives: ZIP, TAR, GZ, BZ2, 7Z, RAR, XZ, Gzip, Bzip2

Executables: ELF, PE (Windows), Mach-O (macOS), Cygwin executables

Documents: PDF, Office (DOC, DOCX, XLS, XLSX, PPT, PPTX), OpenOffice

Media: MP3, MP4, AVI, MOV, FLV, MKV, WAV, FLAC

Databases: SQLite, MySQL dump binary, PostgreSQL binary

Not Detected (Treated as Text):

  • Text files with binary-like extensions (may require --force-text)
  • XML, JSON, YAML, TOML files
  • Source code files (.c, .cpp, .js, .py, .rb, .go, .rs, etc.)
  • Configuration files (.conf, .ini, .cfg)
  • Markup files (.md, .rst, .asciidoc, .html, .xml)
  • Log files (.log, .txt)

Implementation Details

Detection Priority

The algorithm checks methods in this order:

  1. File Extension (Fastest) - O(1) operation

    • Pre-computed set of known binary extensions
    • Useful for common formats
  2. Magic Number (Fast) - O(1) operation on first 512 bytes

    • Identifies file type regardless of extension
    • Reliable for standard formats
  3. Null Byte Scanning (Very Fast) - O(n) but stops at first match

    • Most reliable single indicator
    • Finds binary status almost instantly
  4. Encoding Detection (Medium) - O(n) on full file read

    • Secondary confirmation
    • Analyzes character sequences
  5. Statistical Analysis (Slowest) - O(n) on full file

    • Last resort check
    • Uses byte distribution patterns

Performance Optimization

Two-Stage Detection:

Stage 1: Quick checks (extension, magic, null bytes) - ~1-10ms Stage 2: Deep analysis (encoding, statistics) - Only if Stage 1 inconclusive

Caching:

  • Binary detection results cached per file
  • Cache invalidated on file modification
  • Significant speedup for repeated operations

Early Exit:

  • Process exits immediately after detecting binary
  • No need to read entire file
  • Typical file analyzed in < 1ms

Override Detection

Force processing of files detected as binary:

# Force treat as text (skip binary detection)
sr --find "pattern" --replace "text" --force-text file.bin

# Process with confirmation
sr --find "pattern" --replace "text" --skip-binary --confirm *.bin

Use Cases for Override:

  1. Text-like Binary Formats: Some binary formats contain mostly text (e.g., SVG, XML in archives)

  2. Custom Text Encodings: Custom character sets not detected by standard methods

  3. Intentional Binary Modification: Rare case where binary editing is desired

  4. Testing: Developers testing sr-search-replace functionality

Performance Considerations

Detection Speed

Method Speed Files/sec Best For
Extension Instant 1000+ Quick filtering
Magic Number < 1ms 100+ Standard formats
Null Byte Scan 1-5ms 50+ Most cases
Encoding Check 5-20ms 10-20 Uncertain files
Full Statistics 20-100ms 1-10 Edge cases

Memory Usage

  • Binary detection: ~1-5 MB per file (buffer size dependent)
  • Cache storage: ~1 KB per detected file
  • No significant memory overhead

Recommendations

  • Use --skip-binary by default for safety
  • Combine with --exclude for faster processing: sr --skip-binary --exclude "*.bin" --exclude "*.exe" ...
  • Enable caching for large directory trees
  • Use --dry-run before processing binary-heavy directories

See Also

Clone this wiki locally