Consult an Expert
Trademark
Design Registration
Consult an Expert
Trademark
Copyright
Patent
Infringement
Design Registration
More
Consult an Expert
Consult an Expert
Trademark
Design Registration
Login
HASH-BASED ARCHIVE FILE SYSTEM FOR OPTIMIZED SMALL FILE HANDLING IN HADOOP
Extensive patent search conducted by a registered patent agent
Patent search done by experts in under 48hrs
₹999
₹399
Abstract
Information
Inventors
Applicants
Specification
Documents
ORDINARY APPLICATION
Published
Filed on 10 November 2024
Abstract
ABSTRACT The present disclosure introduces a hash-based archive file system for optimized small file handling in HDFS 100 that enhances data management and retrieval efficiency. This system consolidates small files 102 into merged part files 104, reducing metadata load on the Name Node. Initial metadata is temporarily held in temporary master index file 118, with file names stored in master name file 120 for reference and appending. Metadata is dynamically distributed to slave index files 110 using spittable hash function 106 and scalable hash function 108, ensuring balanced distribution. Hollow trie monotone minimal perfect hash function (HT-MMPHF) 112 sorts metadata in sorted slave index files 114, which is then transferred to final index files 116 for organized, partial loading. The system incorporates threshold monitoring and management 122 for scalability and parallel processing module 124 for rapid file creation. File retrieval and access mechanism 126 enables real-time, selective metadata loading for efficient access. Reference fig 1
Patent Information
Application ID | 202411086583 |
Invention Field | COMPUTER SCIENCE |
Date of Application | 10/11/2024 |
Publication Number | 47/2024 |
Inventors
Name | Address | Country | Nationality |
---|---|---|---|
Dr. Vijay Shankar Sharma | Assistant Professor (Sr. Scale), Department of Computer and Communication Engineering, Manipal University Jaipur, Dehmi Kalan, Near GVK Toll Plaza, Jaipur-Ajmer Expressway, Jaipur, Rajasthan 303007 | India | India |
Dr. N.C. Barwar | Professor, Department of Computer Science and Engineering, MBM University, Jodhpur, Rajasthan, India | India | India |
Applicants
Name | Address | Country | Nationality |
---|---|---|---|
Manipal University Jaipur | Jaipur-Ajmer Express Highway, Dehmi Kalan, Near GVK Toll Plaza, Jaipur, Rajasthan, India, 303007 | India | India |
Specification
Description:DETAILED DESCRIPTION
[00023] The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognise that other embodiments for carrying out or practising the present disclosure are also possible.
[00024] The description set forth below in connection with the appended drawings is intended as a description of certain embodiments of hash-based archived file system for optimised small file handling in hadoop and is not intended to represent the only forms that may be developed or utilised. The description sets forth the various structures and/or functions in connection with the illustrated embodiments; however, it is to be understood that the disclosed embodiments are merely exemplary of the disclosure that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimised to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.
[00025] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however, that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure.
[00026] The terms "comprises", "comprising", "include(s)", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, or system that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or system. In other words, one or more elements in a system or apparatus preceded by "comprises... a" does not, without more constraints, preclude the existence of other elements or additional elements in the system or apparatus.
[00027] In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings and which are shown by way of illustration-specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.
[00028] The present disclosure will be described herein below with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.
[00029] Referring to Fig. 1, hash-based archived file system for optimised small file handling in Hadoop 100 is disclosed, in accordance with one embodiment of the present invention. It comprises of small files 102, merged part files 104, spittable hash function 106, scalable hash function 108, slave index files 110, hollow trie monotone minimal perfect hash function (HT-MMPHF) 112, sorted slave index files 114, final index files 116, temporary master index file 118, master name file 120, threshold monitoring and management 122, parallel processing module 124, file retrieval and access mechanism 126.
[00030] Referring to Fig. 1, the Hash-Based Archive File (HBAF) system for optimized small file handling in Hadoop 100 is designed to tackle the inefficiencies associated with managing vast quantities of small files in the Hadoop Distributed File System (HDFS). Traditional HDFS implementations struggle with memory overload and reduced performance due to the high volume of metadata generated by small files. The HBAF system addresses this by employing a dual-hashing approach, combining Scalable-Splittable Hash Function (SSHF) and Hollow Trie Monotone Minimal Perfect Hash Function (HT-MMPHF), which distributes and organizes metadata efficiently, reducing memory consumption and improving retrieval speeds. This approach also supports real-time file appending and selective partial loading of metadata. Key components are small files 102, merged part files 104, and slave index files 110. The system uses scalable hash function 108 and spittable hash function 106 for dynamic metadata distribution, while hollow trie monotone minimal perfect hash function (HT-MMPHF) 112 preserves metadata order. Final index files 116 and sorted slave index files 114 allow selective loading for faster access. Additional elements, such as temporary master index file 118 and parallel processing module 124, enhance scalability and real-time access for high-volume small file applications.
[00031] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with small files 102, which represent the collection of numerous small files that need to be managed in the Hadoop Distributed File System (HDFS). These small files 102 are initially loaded into the system, where they are consolidated into merged part files 104 to reduce the overhead on HDFS's Name Node. This component 102 is fundamental in initiating the process and reducing the excessive metadata load that typically burdens the name node when handling vast quantities of small files.
[00032] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with merged part files 104, which are generated by merging multiple small files 102 into larger, consolidated files. This merging minimizes the number of metadata entries in the HDFS, thereby reducing memory usage. Merged part files 104 play a critical role in improving system efficiency and work in tandem with slave index files 110 to ensure that metadata remains manageable.
[00033] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with spittable hash function 106, which is a specialized function used to divide and allocate metadata dynamically across different slave index files 110. This function 106 enhances scalability by allowing the metadata to be spread across multiple files rather than being concentrated in a single location, making it easier for HDFS to handle high volumes of small files.
[00034] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with scalable hash function 108, which works in conjunction with spittable hash function 106 to distribute metadata across slave index files 110. This function ensures that as more small files are added, the metadata distribution remains balanced, preventing any single slave index file 110 from becoming overloaded. This dynamic hashing approach supports the system's scalability.
[00035] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with slave index files 110, which act as intermediate metadata storage files. Metadata from small files 102 is initially distributed to these slave index files 110 using split able hash function 106 and scalable hash function 108. This arrangement ensures that metadata is organized and accessible, reducing the load on the main indexing structure of the HDFS.
[00036] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with hollow trie monotone minimal perfect hash function (HT-MMPHF) 112, which is used to maintain the lexicographic order of metadata in slave index files 110 and final index files 116. This order-preserving function allows 112 for efficient metadata retrieval by enabling selective loading, where only the necessary portions of metadata are accessed, thus enhancing retrieval speed.
[00037] Referring to Fig. 1, hash-based archive file system for optimized small file handling in Hadoop 100 is provided with sorted slave index files 114, which are slave index files 110 that have been organized in lexicographical order through HT-MMPHF 112. This sorting process prepares the metadata for transfer to final index files 116, ensuring that data remains orderly and readily accessible.
[00038] Referring to Fig. 1, hash-based archive file system for optimized small file handling in Hadoop 100 is provided with final index files 116, which store the finalized and organized metadata that has been sorted through sorted slave index files 114. These files allow for fast access to specific small file metadata by supporting partial loading, which significantly improves retrieval efficiency.
[00039] The hash-based archive file system for optimized small file handling in Hadoop 100 is provided with temporary master index file 118 (not shown in the diagram), which serves as an intermediate storage location for metadata before it is distributed to slave index files 110. This file temporarily holds metadata, ensuring that it can be backed up and organized in a controlled manner before further processing.
[00040] The hash-based archive file system for optimized small file handling in Hadoop 100 is provided with master name file 120(not shown in the diagram),, which permanently stores the names of all small files within the archive. This component works closely with merged part files 104 to track the files that have been consolidated and added to the archive, enabling efficient file appending and reference management.
[00041] The hash-based archive file system for optimized small file handling in Hadoop 100 is provided with threshold monitoring and management 122(not shown in the diagram), which tracks the capacity of both merged part files 104 and slave index files 110. When capacity limits are reached, this component 122 triggers the creation of new files, maintaining system performance and preventing memory overload.
[00042] The hash-based archive file system for optimized small file handling in Hadoop 100 is provided with parallel processing module 124(not shown in the diagram), which enables the concurrent creation of merged part files 104 and index files 116 . This module accelerates the merging and indexing processes, enhancing system performance and allowing for faster handling of large volumes of small files.
[00043] Referring to Fig. 1, the hash-based archive file system for optimized small file handling in Hadoop 100 is provided with file retrieval and access mechanism 126, which facilitates efficient, real-time access to metadata stored in final index files 116. By allowing selective partial loading of metadata, this mechanism 126 improves retrieval speed and ensures that the system can quickly respond to file access requests.
[00044] Referring to Fig 2, there is illustrated method 200 for hash-based archive file system 100. The method comprises:
At step 202, method 200 includes loading a set of small files 102 into the system for processing;
At step 204, method 200 includes merging each small file into merged part files 104 to consolidate data and reduce metadata entries;
At step 206, method 200 includes storing metadata for each small file in temporary master index file 118 as an initial holding location for metadata;
At step 208, method 200 includes recording each small file's name in master name file 120 for easy reference and appending functionalities;
At step 210, method 200 includes distributing metadata dynamically to slave index files 110 using spittable hash function 106 and scalable hash function 108, ensuring balanced distribution across files;
At step 212, method 200 includes sorting metadata within slave index files 110 into sorted slave index files 114 using hollow trie monotone minimal perfect hash function (HT-MMPHF) 112 to maintain lexicographical order;
At step 214, method 200 includes transferring metadata from sorted slave index files 114 to final index files 116, organizing metadata for optimized retrieval;
At step 216, method 200 includes selectively loading metadata in final index files 116 for real-time access using file retrieval and access mechanism 126;
At step 218, method 200 includes monitoring file sizes in merged part files 104 and slave index files 110 using threshold monitoring and management 122 to prevent overloads and trigger new file creation as needed;
At step 220, method 200 includes executing concurrent processing through parallel processing module 124 to enable simultaneous creation of part and index files, enhancing the system's overall efficiency.
[00045] HBAF improves performance in two primary ways. First, by merging small files into merged part files 104, it reduces memory usage and alleviates the load on the Name Node. Second, the two-level hashing system enables faster access to small files by organizing metadata in a structured, efficient manner. The merging process occurs in parallel to enhance speed, creating multiple merged part files 104 concurrently, with a default level of parallelism set to two, though this can be adjusted.
[00046] The algorithm below provides the details for HBAF function.
Algorithm 1: Hash-Based Archive File (HBAF) Creation & Updating
Step 1: Initial Variable Declaration and Initialization
1.1 Define small files 102 - A set of small files to be processed by the system;
1.2 Define slave index file 110 - Temporary index files generated dynamically by scalable-splittable hash function (SSHF) 108;
1.3 Define merged part file 104 - Initial consolidated file created by merging small files to reduce metadata entries;
1.4 Define temporary master index file 118 - Temporary storage for initial metadata before final distribution;
1.5 Define master name file 120 - Permanent storage for tracking the names of all small files in the archive;
1.6 Define meta data - String variable for holding the metadata of each small file;
1.7 Define small file name - String variable for storing the name of each small file;
1.8 Define final index files 116 - Initially empty files intended to hold organized metadata after processing;
Step 2: Merging Small Files and Building Client-Side Slave Index Files
2.1 Start loop-1 over each small file in small files 102;
2.2 Merge each small file into merged part file 104 to consolidate and minimize the number of metadata entries;
2.3 Copy the metadata of the small file to meta data variable for further processing;
2.4 Copy the name of the small file to small file name variable;
2.5 Append metadata from meta data to temporary master index file 118 for initial metadata storage;
2.6 Append the name of the small file to master name file 120 for reference and appending support;
2.7 Use SSHF 108 to assign a unique ID to slave index file 110 and final index file 116 for orderly data management;
2.8 Append metadata from meta data to the designated slave index file 110 as directed by SSHF 108;
2.9 Check the threshold for slave index file 110 capacity. If exceeded, create a new slave index file 110 and final index file 116 using SSHF 108 and continue appending metadata;
2.10 end of loop-1 after processing all small files.
Step 3: Sorting Slave Index Files and Building Final Index Files 3.1 Start loop-2 over each slave index file 110 identified with a unique ID;
3.2 Sort the metadata within slave index file 110 to prepare for final organization;
3.3 Apply hollow trie monotone minimal perfect hash function (HT-MMPHF) 112 to maintain order in slave index files 110;
3.4 Link HT-MMPHF 112 to the corresponding final index files 116,ensuring metadata order preservation;
3.5 Transfer sorted metadata from slave index files 110 to final index files 116, maintaining order integrity;
3.6 End loop-2 after processing all slave index files 110.
[00047] Referring to Algorithm 1 above, SSHF 108 is an extendible hash function used to allocate metadata dynamically to slave index files 110. Each hash is interpreted as a bit string and organized using a tree structure for efficient metadata lookup. Based on the last two bits of a file name's hash, SSHF 108 assigns metadata entries with matching bit patterns to the same slave index file 110. The function dynamically grows or shrinks slave index files 110 through a splitting mechanism, optimizing metadata storage and retrieval. When a slave index file 110 reaches its capacity limit, it splits, generating a new file and redistributing metadata for synchronization.
[00048] HT-MMPHF 112 is used to establish an ordered, collision-free mapping of static keys (metadata entries) to unique integer values in final index files 116. This "minimal" hash function ensures each entry's sequential order, allowing direct and efficient access. Metadata entries are organized lexicographically by their hash values, with HT-MMPHF 112 preserving this order in final index files 116.
[00049] In the description of the present invention, it is also to be noted that, unless otherwise explicitly specified or limited, the terms "fixed" "attached" "disposed," "mounted," and "connected" are to be construed broadly, and may for example be fixedly connected, detachably connected, or integrally connected, either mechanically or electrically. They may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood in specific cases to those skilled in the art.
[00050] Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non- exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural where appropriate.
[00051] Although embodiments have been described with reference to a number of illustrative embodiments thereof, it should be understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure. More particularly, various variations and modifications are possible in the component parts and/or arrangements of the subject combination arrangement within the scope of the present disclosure, the drawings and the appended claims. In addition to variations and modifications in the component parts and/or arrangements, alternative uses will also be apparent to those skilled in the art.
, C , Claims:WE CLAIM:
1. A hash-based archive file system for optimised small file handling in hadoop 100 comprising of
small files 102 to serve as the initial input for processing and management in the system;
merged part files 104 to consolidate small files and reduce metadata load on the Name Node;
temporary master index file 118 to temporarily hold metadata before distribution to slave index files;
master name file 120 to maintain a record of all small files for reference and appending;
spittable hash function 106 to divide and allocate metadata dynamically across slave index files;
scalable hash function 108 to ensure balanced metadata distribution across files as data grows;
slave index files 110 to act as intermediate storage for dynamically distributed metadata;
sorted slave index files 114 to maintain lexicographical order of metadata for efficient access;
hollow trie monotone minimal perfect hash function (HT-MMPHF) 112 to preserve metadata order and support partial loading in final index files;
final index files 116 to store organized, selectively loadable metadata for optimized retrieval;
threshold monitoring and management 122 to track file sizes and manage file creation as thresholds are reached;
parallel processing module 124 to enable simultaneous file creation and enhance processing speed; and
file retrieval and access mechanism 126 to allow real-time selective loading of metadata for faster access.
2. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein a scalable-splittable hash function (SSHF 108) is configured to dynamically allocate metadata across multiple slave index files 110, balancing metadata distribution to prevent overloads in any single file, thereby enhancing the scalability and performance of metadata management in HDFS.
3. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein SSHF 108 is configured to trigger a split operation in any slave index file 110 that reaches its capacity threshold, dynamically creating new slave index files 110 and redistributing metadata entries, enabling adaptive growth and maintenance of balanced metadata distribution.
4. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein hollow trie monotone minimal perfect hash function (HT-MMPHF 112) is employed to organize metadata entries in slave index files 110 in lexicographical order and map these entries sequentially within final index files 116, providing an efficient structure for fast metadata access.
5. The hash-based archive file system for optimized small file handling in HDFS100 as claimed in claim 1, wherein SSHF 108 and HT-MMPHF 112 are employed in combination to provide two-level metadata organization, where SSHF 108 dynamically distributes metadata across slave index files 110 and HT-MMPHF 112 sequentially arranges metadata in final index files 116, facilitating scalable and rapid file access across large datasets.
6. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein SSHF 108 uses a bit-string-based structure with an ordered tree configuration, assigning metadata to specific slave index files 110 based on the last two bits of each small file's hash, allowing efficient and dynamic storage of metadata entries without performance degradation as data volumes increase.
7. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein HT-MMPHF 112 ensures minimal space complexity by maintaining only the required number of entries in final index files 116, allowing for efficient metadata management with reduced memory usage on the Name Node, thus optimizing HDFS performance for small file handling.
8. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein SSHF 108 belongs to a class of extendible hashing techniques that enables dynamic metadata allocation and adaptive expansion of slave index files 110, reducing the need for full re-indexing while maintaining fast lookup and retrieval efficiency.
9. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein the interaction between SSHF 108 and HT-MMPHF 112 minimizes the time complexity of metadata retrieval by enabling direct and efficient metadata access paths in final index files 116, thereby reducing data retrieval time for large volumes of small files.
10. The hash-based archive file system for optimized small file handling in HDFS 100 as claimed in claim 1, wherein method comprises of
loading a set of small files 102 into the system for processing;
merging each small file into merged part files 104 to consolidate data and reduce metadata entries;
storing metadata for each small file in temporary master index file 118 as an initial holding location for metadata;
recording each small file's name in master name file 120 for easy reference and appending functionalities;
distributing metadata dynamically to slave index files 110 using spittable hash function 106 and scalable hash function 108, ensuring balanced distribution across files;
sorting metadata within slave index files 110 into sorted slave index files 114 using hollow trie monotone minimal perfect hash function (ht-mmphf) 112 to maintain lexicographical order;
transferring metadata from sorted slave index files 114 to final index files 116, organizing metadata for optimized retrieval;
selectively loading metadata in final index files 116 for real-time access using file retrieval and access mechanism 126;
monitoring file sizes in merged part files 104 and slave index files 110 using threshold monitoring and management 122 to prevent overloads and trigger new file creation as needed; and
executing concurrent processing through parallel processing module 124 to enable simultaneous creation of part and index files, enhancing the system's overall efficiency.
Documents
Name | Date |
---|---|
202411086583-COMPLETE SPECIFICATION [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-DECLARATION OF INVENTORSHIP (FORM 5) [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-DRAWINGS [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-EDUCATIONAL INSTITUTION(S) [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-EVIDENCE FOR REGISTRATION UNDER SSI [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-FIGURE OF ABSTRACT [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-FORM 1 [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-FORM FOR SMALL ENTITY(FORM-28) [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-FORM-9 [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-POWER OF AUTHORITY [10-11-2024(online)].pdf | 10/11/2024 |
202411086583-REQUEST FOR EARLY PUBLICATION(FORM-9) [10-11-2024(online)].pdf | 10/11/2024 |
Talk To Experts
Calculators
Downloads
By continuing past this page, you agree to our Terms of Service,, Cookie Policy, Privacy Policy and Refund Policy © - Uber9 Business Process Services Private Limited. All rights reserved.
Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.
Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.