Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Uniquely identify files before uploading with the HTML5 file API
While making a file uploader using HTML5 file API, we want to be sure that no duplicate files are uploaded based on actual data. This prevents wasting storage space and bandwidth by uploading identical files multiple times.
Calculating a hash with MD5 is not an efficient method as all that happens on the client side and is time-consuming. There is actually no perfect shortcut for this task.
Method 1: Basic File Properties Check
The simplest approach is to compare basic file properties like name, size, and last modified date:
<input type="file" id="fileInput" multiple>
<div id="output"></div>
<script>
document.getElementById('fileInput').addEventListener('change', function(event) {
const files = Array.from(event.target.files);
const fileSignatures = new Set();
const duplicates = [];
files.forEach(file => {
const signature = `${file.name}-${file.size}-${file.lastModified}`;
if (fileSignatures.has(signature)) {
duplicates.push(file.name);
} else {
fileSignatures.add(signature);
}
});
const output = document.getElementById('output');
if (duplicates.length > 0) {
output.innerHTML = `<p>Duplicate files detected: ${duplicates.join(', ')}</p>`;
} else {
output.innerHTML = `<p>No duplicates found. ${files.length} unique files selected.</p>`;
}
});
</script>
Method 2: Content-Based Hash (Partial)
For more accurate duplicate detection, we can create a hash from a subset of file blocks using a predefined window:
<input type="file" id="fileInput2" multiple>
<div id="output2"></div>
<script>
async function createFileHash(file) {
const chunkSize = 8192; // Read first 8KB
const chunk = file.slice(0, Math.min(chunkSize, file.size));
const arrayBuffer = await chunk.arrayBuffer();
// Simple hash function (for demonstration)
let hash = 0;
const bytes = new Uint8Array(arrayBuffer);
for (let i = 0; i < bytes.length; i++) {
hash = ((hash << 5) - hash + bytes[i]) & 0xffffffff;
}
return hash.toString(36) + file.size;
}
document.getElementById('fileInput2').addEventListener('change', async function(event) {
const files = Array.from(event.target.files);
const fileHashes = new Map();
const duplicates = [];
for (const file of files) {
const hash = await createFileHash(file);
if (fileHashes.has(hash)) {
duplicates.push(`${file.name} (duplicate of ${fileHashes.get(hash)})`);
} else {
fileHashes.set(hash, file.name);
}
}
const output = document.getElementById('output2');
if (duplicates.length > 0) {
output.innerHTML = `<p>Content-based duplicates: ${duplicates.join(', ')}</p>`;
} else {
output.innerHTML = `<p>No content duplicates found in ${files.length} files.</p>`;
}
});
</script>
Method 3: Complete File Reading
If we need to identify duplicate files with no confusion, we have to read the complete content of each file and compare it:
async function getFullFileHash(file) {
const arrayBuffer = await file.arrayBuffer();
const hashBuffer = await crypto.subtle.digest('SHA-256', arrayBuffer);
const hashArray = Array.from(new Uint8Array(hashBuffer));
return hashArray.map(b => b.toString(16).padStart(2, '0')).join('');
}
// Note: This approach is computationally expensive for large files
// and should be used carefully in production environments
Comparison
| Method | Accuracy | Performance | Best For |
|---|---|---|---|
| Basic Properties | Low | Fast | Quick screening |
| Partial Hash | Medium | Good | Balanced approach |
| Full Content Hash | High | Slow | Critical accuracy needs |
Conclusion
For most web applications, using partial file hashing provides a good balance between accuracy and performance. Choose the method based on your specific requirements for duplicate detection accuracy versus processing speed.
