-
Notifications
You must be signed in to change notification settings - Fork 0
Binary File Detection
Automatically detect and skip binary files to prevent corruption of output and maintain data integrity.
- Overview
- Detection Algorithm Flowchart
- Enable Binary Detection
- Detection Methods
- Configuration
- Supported Formats
- Implementation Details
- Override Detection
- Performance Considerations
- See Also
Multi-layer detection prevents accidental modification of binary files like images, archives, executables. sr-search-replace uses a sophisticated multi-layer approach to identify binary content:
- File Extension Analysis - Quick check using known binary extensions
- Magic Number Detection - Identifies file types by byte signatures
- Character Encoding Detection - Analyzes byte patterns for text encoding
- Statistical Analysis - Examines null bytes and character distribution
The binary detection process follows this comprehensive algorithm:
┌─────────────────────────────────────┐
│ FILE RECEIVED FOR PROCESSING │
└──────────────┬──────────────────────┘
│
▼
┌──────────────────┐
│ Check if file │
│ is readable │
│ and accessible │
└──────┬───────────┘
│
┌──────▼───────────┐
│ SUCCESS? │
└─┬────────────┬───┘
│ NO │ YES
│ │
▼ ▼
┌─────────┐ ┌──────────────────────┐
│ ERROR │ │ STEP 1: EXTENSION │
│ SKIP │ │ Check file extension │
└─────────┘ │ against binary list │
└──────┬───────────────┘
│
┌──────▼──────────┐
│ BINARY EXT? │
└─┬───────────┬───┘
│ YES │ NO
│ │
▼ ▼
┌─────────┐ ┌──────────────────────┐
│ SKIP │ │ STEP 2: MAGIC NUMBER │
│ BINARY │ │ Read first 512 bytes │
└─────────┘ │ Check magic signatures
└──────┬───────────────┘
│
┌──────▼──────────┐
│ MAGIC MATCH? │
└─┬───────────┬───┘
│ YES │ NO
│ │
▼ ▼
┌─────────┐ ┌──────────────────────┐
│ SKIP │ │ STEP 3: ENCODING │
│ BINARY │ │ Check byte patterns │
└─────────┘ │ for UTF-8/ASCII │
└──────┬───────────────┘
│
┌──────▼──────────┐
│ NULL BYTES? │
└─┬───────────┬───┘
│ YES │ NO
│ │
▼ ▼
┌─────────┐ ┌──────────────────────┐
│ SKIP │ │ STEP 4: STATISTICS │
│ BINARY │ │ Analyze byte distrib.│
└─────────┘ │ Count control chars │
└──────┬───────────────┘
│
┌──────▼──────────┐
│ HIGH CONTROL │
│ CHAR COUNT? │
└─┬───────────┬───┘
│ YES │ NO
│ │
▼ ▼
┌─────────┐ ┌──────────────┐
│ SKIP │ │ TREAT AS │
│ BINARY │ │ TEXT / PROCESS
└─────────┘ └──────────────┘
Use the --skip-binary flag to enable multi-layer binary file detection:
sr --find "pattern" --replace "text" --skip-binary --recursive .Or with file pattern:
sr --find "pattern" --replace "text" --skip-binary *.txtsr-search-replace employs four complementary detection methods:
The most reliable and fastest method - scans for null bytes (0x00) which indicate binary content:
# Any null byte indicates binary file
if grep -q $'\x00' file.bin; then echo "BINARY"; fiEffectiveness: 99%+
Speed: Instant (first null byte found)
False Positives: Extremely rare
Identifies file types by reading file signatures (magic bytes) from file headers:
# Examples of magic numbers:
# PNG: 89 50 4E 47 (hex) = \x89PNG
# JPEG: FF D8 FF (hex)
# ZIP: 50 4B 03 04 (hex) = PK
# ELF: 7F 45 4C 46 (hex) = .ELFCommon Binary Signatures:
- Images: PNG, JPG, GIF, BMP
- Archives: ZIP, RAR, TAR, GZIP
- Executables: ELF, Mach-O, PE (Windows)
- Documents: PDF, Office formats
Analyzes byte sequences to detect if file is valid UTF-8 or ASCII:
# Check for valid UTF-8 encoding
if file -i file.txt | grep -q "charset=iso-8859-1\|charset=unknown"; then
echo "POTENTIALLY BINARY"
fiIndicators of Binary:
- Invalid UTF-8 sequences
- High proportion of control characters (0x00-0x1F)
- Invalid character sequences
- Mixed or unknown encoding
Examines overall byte distribution and control character frequency:
# Count control characters (0x00-0x1F)
binary_count=$(od -An -tx1 file.txt | grep -o "[0-1][0-9a-f]" | wc -l)
if [ $binary_count -gt $threshold ]; then
echo "BINARY"
fiThresholds:
- < 1% control chars: Likely text
- 1-5% control chars: Borderline (use other methods)
-
5% control chars: Likely binary
# Enable or disable binary detection
export SR_SKIP_BINARY="true"
# Custom binary file extensions (colon-separated)
export SR_BINARY_EXTENSIONS=".bin:.exe:.dll:.so:.dylib"
# Binary detection threshold (percentage of control chars)
export SR_BINARY_THRESHOLD="5"
# Timeout for binary detection (milliseconds)
export SR_BINARY_TIMEOUT="1000"
# Enable detailed binary detection logging
export SR_DEBUG_BINARY="true"{
"skipBinary": true,
"binaryExtensions": [
".bin", ".exe", ".dll", ".so", ".dylib",
".jpg", ".jpeg", ".png", ".gif", ".bmp",
".zip", ".rar", ".tar", ".gz", ".7z",
".pdf", ".doc", ".docx", ".xls", ".xlsx"
],
"binaryThreshold": 5,
"binaryTimeout": 1000,
"debugBinary": false
}In Python implementation, modify these constants at the top of sr.py:
# Binary detection configuration
BINARY_THRESHOLD = 5 # Percentage of control chars (0x00-0x1F)
BINARY_TIMEOUT = 1000 # Milliseconds
SKIP_BINARY = True # Enable by default
# Known binary extensions
BINARY_EXTENSIONS = {
# Images
'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.ico', '.tiff', '.webp',
# Archives
'.zip', '.rar', '.7z', '.tar', '.gz', '.bz2', '.xz',
# Executables
'.exe', '.dll', '.so', '.dylib', '.o', '.a', '.lib',
# Documents
'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx',
# Media
'.mp3', '.mp4', '.avi', '.mov', '.flv', '.mkv', '.wav'
}
# Magic number signatures for file identification
MAGIC_NUMBERS = {
b'\x89PNG': 'png',
b'\xff\xd8\xff': 'jpeg',
b'GIF87a': 'gif',
b'GIF89a': 'gif',
b'BM': 'bmp',
b'%PDF': 'pdf',
b'PK\x03\x04': 'zip',
b'Rar!': 'rar',
b'\x7fELF': 'elf',
b'MZ': 'pe',
}In Bash implementation, modify sr.sh:
# Binary detection configuration
readonly BINARY_THRESHOLD=5
readonly BINARY_TIMEOUT=1000
readonly SKIP_BINARY=true
# Known binary extensions
readonly BINARY_EXTENSIONS="bin exe dll so dylib jpg jpeg png gif bmp zip rar tar gz pdf"
# Function to check if file is binary
is_binary() {
local file="$1"
local ext="${file##*.}"
# Check extension
if [[ "$BINARY_EXTENSIONS" =~ $ext ]]; then
return 0 # Binary
fi
# Check for null bytes
if grep -q $'\x00' "$file" 2>/dev/null; then
return 0 # Binary
fi
return 1 # Text
}Images: PNG, JPG, GIF, BMP, ICO, TIFF, WebP, SVG (SVG treated as text)
Archives: ZIP, TAR, GZ, BZ2, 7Z, RAR, XZ, Gzip, Bzip2
Executables: ELF, PE (Windows), Mach-O (macOS), Cygwin executables
Documents: PDF, Office (DOC, DOCX, XLS, XLSX, PPT, PPTX), OpenOffice
Media: MP3, MP4, AVI, MOV, FLV, MKV, WAV, FLAC
Databases: SQLite, MySQL dump binary, PostgreSQL binary
- Text files with binary-like extensions (may require --force-text)
- XML, JSON, YAML, TOML files
- Source code files (.c, .cpp, .js, .py, .rb, .go, .rs, etc.)
- Configuration files (.conf, .ini, .cfg)
- Markup files (.md, .rst, .asciidoc, .html, .xml)
- Log files (.log, .txt)
The algorithm checks methods in this order:
-
File Extension (Fastest) - O(1) operation
- Pre-computed set of known binary extensions
- Useful for common formats
-
Magic Number (Fast) - O(1) operation on first 512 bytes
- Identifies file type regardless of extension
- Reliable for standard formats
-
Null Byte Scanning (Very Fast) - O(n) but stops at first match
- Most reliable single indicator
- Finds binary status almost instantly
-
Encoding Detection (Medium) - O(n) on full file read
- Secondary confirmation
- Analyzes character sequences
-
Statistical Analysis (Slowest) - O(n) on full file
- Last resort check
- Uses byte distribution patterns
Two-Stage Detection:
Stage 1: Quick checks (extension, magic, null bytes) - ~1-10ms Stage 2: Deep analysis (encoding, statistics) - Only if Stage 1 inconclusive
Caching:
- Binary detection results cached per file
- Cache invalidated on file modification
- Significant speedup for repeated operations
Early Exit:
- Process exits immediately after detecting binary
- No need to read entire file
- Typical file analyzed in < 1ms
Force processing of files detected as binary:
# Force treat as text (skip binary detection)
sr --find "pattern" --replace "text" --force-text file.bin
# Process with confirmation
sr --find "pattern" --replace "text" --skip-binary --confirm *.bin-
Text-like Binary Formats: Some binary formats contain mostly text (e.g., SVG, XML in archives)
-
Custom Text Encodings: Custom character sets not detected by standard methods
-
Intentional Binary Modification: Rare case where binary editing is desired
-
Testing: Developers testing sr-search-replace functionality
| Method | Speed | Files/sec | Best For |
|---|---|---|---|
| Extension | Instant | 1000+ | Quick filtering |
| Magic Number | < 1ms | 100+ | Standard formats |
| Null Byte Scan | 1-5ms | 50+ | Most cases |
| Encoding Check | 5-20ms | 10-20 | Uncertain files |
| Full Statistics | 20-100ms | 1-10 | Edge cases |
- Binary detection: ~1-5 MB per file (buffer size dependent)
- Cache storage: ~1 KB per detected file
- No significant memory overhead
-
Use
--skip-binaryby default for safety -
Combine with
--excludefor faster processing:sr --skip-binary --exclude "*.bin" --exclude "*.exe" ... - Enable caching for large directory trees
-
Use
--dry-runbefore processing binary-heavy directories
- Architecture Overview - System design
- Backup & Rollback - Recovery mechanisms
- Performance Tuning - Optimization tips
- Command Reference - All available options