FAQ – Arkkiivi

UKK

What types of files can be imported in Arkkiivi?

Data can be imported in Arkkiivi in the following file formats: png, tif, tiff, pdf, xml and txt. Metadata identification supports typed text only.

In what format can the results be extracted from Arkkiivi?

The user can download the results in csv format.

I cannot test Arkkiivi, as I keep getting security warnings. How should I proceed?

The actual test environment works on top of the still unencrypted HTTP connection. Therefore, the security policy of some organisations may prevent the transfer of data. We also aim to have the test environment behind a secure HTTPS connection during the project.

Are the files uploaded to Arkkiivi safe?

No, anyone with access to the server can view the files. However, Arkkiivi does not save the uploaded files anywhere but deletes them after the run.

Are there security risks in Arkkiivi?

Yes. The Arkkiivi development environment is a demo environment that shows what can be done with the artificial intelligence blocks behind Arkkiivi.

Will the source codes for Arkkiivi or individual components be shared?

Trained models with source codes can be found on the GitHub website

The component does not seem to work as described. Am I doing something wrong?

The models of underlying the components have been taught with specific teaching materials, so it is highly possible that there may be deficiencies in the functioning of the components with the material you are using. The components utilise the Tesseract application intended for the interpretation of typed text in the analysis of the material’s text content. Therefore, handwritten text is usually misidentified, though errors may of course also occur in the interpretation of typed text. Thus, components that identify metadata from texts do not work with handwritten materials. Instead, the component that detects blank pages or scanning errors also works with handwritten material or photos. For born-digital material, the limitations are the precision of the NER and keywording components.

Are the models behind the components available and if so, how could I further train them with my own material? How should I proceed?

The models will be published on GitHub and can be freely installed from there. If your organisation lacks the necessary skills, help can be obtained from companies in the industry.

I get strange keywords or ones that describe the text content poorly. How should I proceed?

The results can be edited afterwards in the csv file.