Deduplication of data items in preparation for archiving

This is a business opportunity from an overseas buyer. Pitch for the business and explain how your company meets their requirements.

NCI Agency is seeking information from nations and their industry regarding the availability of solutions among all NATO Nations.

Details

Opportunity closing date
03 April 2023
Opportunity publication date
10 March 2023
Opportunity type
NATO
Industry
Defence
Enquiries received
0
Value of contract
to be confirmed
Report opportunity

Description

The NCIA is exploring the potential for industry to perform a complex data preparation and analysis tasks on a several hundred terabytes (TB) of data appearing in diverse formats. The main objective of this task will be to identify and mark the duplicated items for given data set. Two options are being considered for this activity.

Under option A, all work is performed by the contractor at their premises with facility clearance, on their CIS equipment with appropriate security accreditation, by staff holding a Personal Security Clearance (PSC).

Under option B, all work is performed by the contractor at NCIA’s premises, on NCIA provided CIS equipment, by staff holding a PSC.

Both options require industry to: - receive and ingest data- reconstruct the data sets - process the data sets - extract the metadata of the items in data sets - migrate/export to the formats suitable for digital preservation- report the summary statistics of ingested dataset- report on results of identified duplicates, redundant libraries and non-record items within the datasets- enable the action to “keep”, “merge” or “delete” data items- create an archival record set, consisting of records and metadata of permanent value, in accordance with NATO directives

All work to be carried out at the contractor premises, on the contractors CIS.

The contractor must have an operating environment with necessary accreditations and approvals to handle NS data.

The contactor must develop workflows or analytical components that perform items de-duplication, extraction of metadata and extraction of data sets from NATO Functional Area Services

The work should prepare the records and related metadata of permanent value in accordance with NATO directives.

The work should be coordinated and in accordance with the direction of JFCBS Archivists to whom the remote access to the environment should be provided to the processed data and, if viable, workflows/analytical components. JFCBS will validate results of analytics and data sets, assess progress, and answers RFIs if needed.

The contractor should provide an IT operating environment, with necessary accreditations and certifications regulated by NATO Security policy that will support following:- data transportation - from NATO to contractor facilities and vice versa- physical facility with required security approvals and accreditation up to including NS- storage with capacity to store more than 200TB of data- computing hardware that can process more than 200TB of data- operating environment that host the tools should enable following activities:

Extract the content and metadata from the following items:- virtual disks (vmware)- file system- zip Archives (zip, tar…)- sharePoint database – backup files and running files- FASs databases –to define the records (*please see a mandatory FASs in Table 1)- extract the metadata from the files (for example properties of an office file like a document classification)- perform deduplication based on defined rules and outputs- visualize the deduplication process output:- overall summary statistics (for example % of overlaps; number of duplicates; unique file size vs all file size):- URIs for binary same items; duplicated count; metadata- URIs for high similarity items in content (for example 70% of similarity score and above); metadata- folder level (Library) for duplicated items; percent of binary same or similar items; folder metadata- enables following actions to the end users:- remote access to the environment- setting up the rules to label and show the items lists (for example – select the rows with items/folders with a % overlap above 70% and file types in the list (pdf….))- applying the rules on the sample versus full set (label the rows according to the rules)- ability to modify rules and re-execute- ability to export the decision sheet (pdf) – list of the items that need to be kept, deleted of merged- executing the decision to “keep”, “merge” or “delete” items- saving/exporting the items labelled with “keep” and “merge” to (save to different medium)(*merge – keep a version of the item content and merge metadata from identified duplicated items; keep – keep the items and metadata; delete – delete the items and metadata)o Prepare data sets for export to preservice that consist of:- original file structure (library structure) or defined by JFCBS Archivists Content files in “archive” format (configuration provided by archivist)- metadata files (alongside content files)- potential conversion to long term preservation format like pdf-a, etc. 

3. High level requirements – Option B

3.1 All work to be carried out at NCIA premises, on NCIA CIS.

3.2 The contractor must have an operating environment with necessary accreditations and approvals to handle NS data.

3.3 The contactor must develop workflows or analytical components that perform items de-duplication, extraction of metadata and extraction of data sets from NATO Functional Area Services.

3.4 The work should prepare the records and related metadata of permanent value in accordance with NATO directives.

3.5 The work should be coordinated and in accordance with the direction of JFCBS Archivists to whom the remote access to the environment should be provided to the processed data and, if viable, workflows/analytical components. JFCBS will validate results of analytics and data sets, assess progress, and answers RFIs if needed.

3.6 Contractor will be provided with access to an IT operating environment, with necessary accreditation, certification and an initial software toolkit. Contractor is permitted to install additional software in accordance with NATO security policy.

3.7 Contractor should have: Data transportation - from NATO to contractor facilities and vice versa Physical facility with required security approvals and accreditation up to including NS:- Storage with capacity to store more than 200TB of data- Computing hardware that can process more than 200TB of data The contractor should execute following activities:1. Extract the content and metadata from following item containers:- Virtual disks (vmware)- File system- Zip Archives (zip, tar…)- SharePoint database – backup files and running files- FASs databases – NATO support required to define the records (*please see a mandatory FASs in Table 1)2. Extract the metadata from the files (for example properties of an office file like a document classification)3. Perform deduplication based on defined rules and outputs4. Visualize the deduplication process output:- Overall summary statistics (for example % of overlaps; number of duplicates; unique file size vs all file size):- URIs for binary same items; duplicated count; metadata- URIs for high similarity items in content (for example 70% of similarity score and above); metadata- Folder level (Library) for duplicated items; percent of binary same or similar items; folder metadata5. Enables following actions to the end users:- Remote access to the environment- Setting up the rules to label and show the items lists (for example – select the rows with items/folders with a % overlap above 70% and file types in the list (pdf….))- Applying the rules on the sample versus full set (label the rows according to the rules)- Ability to modify rules and re-execute- Ability to export the decision sheet (pdf) – list of the items that need to be kept, deleted of merged- Executing the decision to “keep”, “merge” or “delete” items- Saving/exporting the items labelled with “keep” and “merge” to (save to different medium)(*merge – keep a version of the item content and merge metadata from identified duplicated items; keep – keep the items and metadata; delete – delete the items and metadata)6. Prepare data sets for export to preservice that consist of:- Original file structure (library structure) or defined by JFCBS Archivists Content files in “archive” format (configuration provided by archivist)- Metadata files (alongside content files)- Potential conversion to long term preservation format like pdf-a, etc.

Opportunity closing date
03 April 2023
Value of contract
to be confirmed
The buyer is happy to talk to
manufacturers, wholesalers, distributors, agents, consultants

The deadline to apply for this opportunity has passed.
Visit the opportunities page to find another.

Is there anything wrong with this page?