Uniquely identify files before uploading with the HTML5 file API

While making a file uploader using HTML5 file API, we want to be sure that no duplicate files are uploaded based on actual data. This prevents wasting storage space and bandwidth by uploading identical files multiple times.

Calculating a hash with MD5 is not an efficient method as all that happens on the client side and is time-consuming. There is actually no perfect shortcut for this task.

Method 1: Basic File Properties Check

The simplest approach is to compare basic file properties like name, size, and last modified date:

<input type="file" id="fileInput" multiple>
<div id="output"></div>

<script>
document.getElementById('fileInput').addEventListener('change', function(event) {
    const files = Array.from(event.target.files);
    const fileSignatures = new Set();
    const duplicates = [];
    
    files.forEach(file => {
        const signature = `${file.name}-${file.size}-${file.lastModified}`;
        
        if (fileSignatures.has(signature)) {
            duplicates.push(file.name);
        } else {
            fileSignatures.add(signature);
        }
    });
    
    const output = document.getElementById('output');
    if (duplicates.length > 0) {
        output.innerHTML = `<p>Duplicate files detected: ${duplicates.join(', ')}</p>`;
    } else {
        output.innerHTML = `<p>No duplicates found. ${files.length} unique files selected.</p>`;
    }
});
</script>

Method 2: Content-Based Hash (Partial)

For more accurate duplicate detection, we can create a hash from a subset of file blocks using a predefined window:

<input type="file" id="fileInput2" multiple>
<div id="output2"></div>

<script>
async function createFileHash(file) {
    const chunkSize = 8192; // Read first 8KB
    const chunk = file.slice(0, Math.min(chunkSize, file.size));
    const arrayBuffer = await chunk.arrayBuffer();
    
    // Simple hash function (for demonstration)
    let hash = 0;
    const bytes = new Uint8Array(arrayBuffer);
    for (let i = 0; i < bytes.length; i++) {
        hash = ((hash << 5) - hash + bytes[i]) & 0xffffffff;
    }
    return hash.toString(36) + file.size;
}

document.getElementById('fileInput2').addEventListener('change', async function(event) {
    const files = Array.from(event.target.files);
    const fileHashes = new Map();
    const duplicates = [];
    
    for (const file of files) {
        const hash = await createFileHash(file);
        
        if (fileHashes.has(hash)) {
            duplicates.push(`${file.name} (duplicate of ${fileHashes.get(hash)})`);
        } else {
            fileHashes.set(hash, file.name);
        }
    }
    
    const output = document.getElementById('output2');
    if (duplicates.length > 0) {
        output.innerHTML = `<p>Content-based duplicates: ${duplicates.join(', ')}</p>`;
    } else {
        output.innerHTML = `<p>No content duplicates found in ${files.length} files.</p>`;
    }
});
</script>

Method 3: Complete File Reading

If we need to identify duplicate files with no confusion, we have to read the complete content of each file and compare it:

async function getFullFileHash(file) {
    const arrayBuffer = await file.arrayBuffer();
    const hashBuffer = await crypto.subtle.digest('SHA-256', arrayBuffer);
    const hashArray = Array.from(new Uint8Array(hashBuffer));
    return hashArray.map(b => b.toString(16).padStart(2, '0')).join('');
}

// Note: This approach is computationally expensive for large files
// and should be used carefully in production environments

Comparison

Method Accuracy Performance Best For
Basic Properties Low Fast Quick screening
Partial Hash Medium Good Balanced approach
Full Content Hash High Slow Critical accuracy needs

Conclusion

For most web applications, using partial file hashing provides a good balance between accuracy and performance. Choose the method based on your specific requirements for duplicate detection accuracy versus processing speed.

Updated on: 2026-03-15T23:18:59+05:30

266 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements