CLARIN NL FAQ

Welcome to the Clarin-NL Helpdesk FAQ In this section, previously asked questions on Clarin topics can be found. This information is presented in chapters that reflect the division in Clarin topics made elsewhere: If the answer to your question is not here, please send an e-mail to the Clarin-NL ​helpdesk

Metadata

Definitions

QuestionAnswer
What is metadata? The definition of metadata varies in different scientific traditions. Some adopt a broader definition of metadata (all data about data, including annotations). Others make a distinction between metadata and annotations. CLARIN adopts the latter viewpoint and defines metadata as data about data: information describing properties of linguistic resources. Think of the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.
What are annotations? Annotations are data about an inner part of a resource (or subresource) that is not a resource itself.
What is a metadata scheme? A metadata scheme is a fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.
What is metadata harvesting? Metadata harvesting is a term used for gathering metadata descriptions from several locations and storing it in a central database. You can find the results of such a harvesting process at ​http://catalog.clarin.eu/ (click on OLAC data providers)
What is CMDI? Descriptive Metadata is used to characterize data resources and tools to facilitate discovery and management in large (virtual) infrastructures and repositories, i.e. they make resources visible to everyone. CMDI (Component MetaData Infrastructure) is the CLARIN Component Metadata Framework. The need for a component based metadata framework has been established in studies and discussions on metadata in the European CLARIN preparatory project. A first version of CMDI has been defined. Currently a project (executed by representatives of the targeted Dutch CLARIN Centres) is about to start up to carry out the first experiments with the CMDI against real data, in particular data which have been found to be troublesome for the IMDI framework. If these experiments are successful, a stable and tested version of CMDI can be released for use by others and supporting tools can be developed.
What is IMDI? The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
^ UP

Questions about metadata that involve tools

QuestionAnswer
How can I update my metadata in the Virtual Language Observatory? See this link
If anybody can create these metadata components, how can you still search through the resulting metadata descriptions? There are indeed issues with searching if people aren't using matching descriptions. Think of someone calling a collection of texts a "text archive", while someone else might be searching for a "text corpus". Or think of all the variants that people can use for one and the same country: the Netherlands, Netherlands, Holland, etc. The same goes for lingustic annotations: "noun" and "substantive" can both be used to describe the same part-of-speech tag. To counter these problems the metadata components contain links to a kind of database that contains atomic concepts (say "country" or "resource type"). We call them data categories. Smart software will later on be able to "see" that if a user searches for nouns, he might also be interested in substantives.
So where are the data categories stored then? In a data category registry - a server that can be reached via the internet, both by human users and computer programs.
Does such a data category registry already exist? Yes it does - you can have a look at it via this link
Are there already CLARIN-suggested data categories in it? Yes, go to the site mentioned above and select 'clarin' as Tenant in de selection menu on the left.
Can I edit components/profiles in the public space in order to update them in the Component Registry? No profiles/components in the public space of the Component Registry cannot be modified, since this would invalidate all previous metadata made with the "old" version.
Since components and profiles in the public space of the Component Registry cannot be modified, does this mean I have to make something new entirely? In most cases it is possible to reuse existing subcomponents that do not need to be modified. It is possible to edit existing profiles/components (right-click: "Edit Item"), resulting in the appearance of a copy of the profile or component in the workspace of the Component Registry. Here, you can make the changes that are required (and reuse what does not require change) and eventually publish it as a new profile/component.
Is it possible to change the order of components or elements when I am editing profiles or components in the Component Registry? Yes use the up and down arrows to change the order.
In the Component Registry, when I manually add components and elements, I see them as a flat list, whereas adding them through "drag-n-drop" results in a collapsible list. What are the differences between the two?
  • "Drag-n-dropping" components to a profile or component in the Component Registry results in adding this component as a reference. Components that appear in the lower part of the editing screen in the Component Registry are reusable.
  • Adding components manually results in an "in-line" display in the profile or component that you are working on. They remain "local" and will NOT be added as a reusable component to the Component Registry.
Why is there a difference between the components listed in the Component Registry and those accessed through this link? This is because the list in this link is the result of the efforts done by the original metadata toolkit. The toolkit was used initially to create profiles and components. The original toolkit was meant to get people started with CMDI and has lots of shortcomings, these shortcomings are addressed in the ComponentRegistry (e.g. long term storage, easy browsing and editing, web services for usage in other applications.) Because the metadata toolkit is not actively maintained anymore little difference will occur. In short, please use the Component Registry.
I made a component in the Component Registry and saved it in my workspace. When I edit this component afterward and save it, my original component is not modified. Instead, a new version appears in my workspace. Is this intentional behavior? This is fixed there are now two save buttons, "save" which will overwrite if possible and "save as new" that always creates a new instance.
If components/profiles in my workspace of the Component Registry are not overwritten when I edit them, how do I prevent a huge list of copies forming? You can overwrite existing components/profiles using the "Save" button. One can also delete a copy from the work space by right-clicking it and selecting the "Delete" option.
How do I make a new component/profile in the Component Registry after adding some other change? You can do this by pressing the "Clear Changes"-button.
Do I need to publish all components that I made prior to the profile they are referenced from? Yes, in order to publish a profile, you have to publish its subcomponents that you made. This is also true for publishing components with subcomponents.
I want to publish a component but I get the error: Failed to register: - referenced component cannot be found in the published components: ComponentName (clarin.eu:referencenumber) Did I forget something? Maybe you forgot to update your references to published versions of the component? This is indicated by a red border around the component. You will have to replace that component for the published one. Update: recently, a new version of the ComponentRegistry was published where versions of components no longer change upon publication. This error should therefore not occur anymore.
I finished editing my profile in the Component Registry and want to make metadata with it in Arbil. Do I need to publish it in order to do that? Not necessarily. When publishing a profile you want to be sure that it is 100% perfect. Of course, one needs to be able to test a profile while editing it. The following option is available for that purpose:
  • In the "Browse..." panel of the Component Registry, right-click the profile and select "Download as XSD..." (This action will include all components in that profile) alternatively you can right-click and select "Show info" this will show the accessible url of the xsd.
  • In Arbil, choose "Options", "Templates & Profiles"
  • In Templates & Profiles, choose "Add File", and select the downloaded profile xsd
  • Close Templates & Profiles and right-click the local corpus, choose "Add", "yourprofile.xsd"
  • Now you are ready to make metadata with it
In Arbil, what is the meaning of the two different symbols (the "diamond" and the "tray") in the 'Add'-menu?
  • The diamond icon indicates that you will add a node to the tree. This node is usually a collection of fields that can be added multiple times (think of an "actor" that speaks multiple "languages" , each having the subfields "Id", "Name", "MotherTongue", "PrimaryLanguage", and "Description").
  • The tray icon indicates that you will add a field to a node (a second "Description" field for example).
^ UP

Questions about metadata that involve tools

QuestionAnswer
How can I provide CMDI metadata over OAI-PMH? See this link
Do you have more information about this harvesting process? Yes - have a look here
We have metadata for our resource. What do we need to do to make it harvestable? These procedures are still under development but for now, all you need to do is provide this data to a CLARIN centre. It is their responsibility to process this data and make it harvestable.
^ UP

Other questions about metadata

QuestionAnswer
What metadata schemes are there for the description of linguistic resources? Quite a few, examples are: Dublin Core​, OLAC which is an enriched version of Dublin Core, IMDI, the TEI header, ...
So what metadata scheme is being used within CLARIN? Good question. In fact there is not such a thing as a single CLARIN metadata scheme. Practice showed that using a particular scheme for a large community (e.g. the humanities) often results in a mismatch between the chosen elements and the needs of the user.
If there is no single metadata scheme, how should I describe my resources to be compatible with the CLARIN infrastructure? CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.
So where do I find more information about creating components, profiles and using profiles created by others? here
Where can I find all details, references and the background for this component-based metadata concept? Check out the CLARIN specification document here
But if I deliver CLARIN metadata today, can you make my metadata available to the broad public? Not yet - bringing all metadata descriptions together ("harvesting"), making them searchable ("indexing") and citeable ("creating persistent identifiers") is an important part of the infrastructure that CLARIN is building. As with all infrastructure these things require a solid base to build on. That base is currently being constructed, but this also means that there is currently no simple-to-use method for accessing the new CLARIN metadata.
So what do I do if I would like to make my metadata available to the public right away? Until the infrastructure for the component-based metadata is fully in place you can use OLAC or IMDI. Data described in one of both formats will be made available and searchable via the CLARIN catalog and the Virtual Language Observatory. Apart from that we will ensure that these metadata descriptions will be converted to CLARIN component-based metadata.
OK, my IMDI or OLAC metadata (describing linguistic resources) is ready, how to proceed now? Send an e-mail to ​Lari Lampen and we will incorporate your metadata as soon as possible
Where can I obtain more introductory information on CMDI? In the CMDI Short Guide. On the main page of this helpdesk portal, more links to information can be found.
How do I make CLARIN-compliant Metadata? As long as CMDI has not been released officially, you should make metadata for resources in your project in accordance with IMDI. You will be assisted by an IS-specialist who is an expert in the area of metadata. It will be guaranteed that IMDI descriptions of metadata can be automatically converted into CMDI descriptions. If CMDI has been officially released before or at the very beginning of your project, you will have to make a description in accordance with CMDI. CLARIN-NL will provide ample educational and training opportunities to get oneself acquainted with CMDI and tools that support it.
We want to make metadata for a web application. Can you suggest a suitable CMDI profile for such a task? Currently, such a profile is being developed by the Adept project, based on the application form for web applications for CLARIN.
The terms "metadata element" and "metadata attribute" are both used in background literature. What are their differences? None (apart from terminology), they mean exactly the same in this context.
What is the best way to include time-coded metadata (e.g. metadata coming from ASR) in the OAI framework? CMDI assumes that the metadata refers to the entire resource, meaning the actual recording. Descriptions of events and periods of time are regarded as annotations. This differs from MPEG7 metadata, where metadata, annotations and recording intermingle. It is of course possible in CMDI to have metadata records referring to specific parts of a recording (e.g. when several interviews are bundled in one media-file).
My CMDI metadata does not validate anymore. What could be wrong? We have made a change (on 14 March 2011) in the XSD's being generated from the cmdi profiles. The XSD's are responsible for validating your metadata, and since the XSD's have changed your metadata does not confirm to the XSD schema anymore. You can make the following changes to your metadata headers to resolve this. Should become:
^ UP

ISOcat and OpenSKOS

QuestionAnswer
What is ISOcat? ISOcat was a web-based implementation to store and make accessible concepts (a concept registry), more specifically data categories, that are relevant for the CLARIN infrastructure and for encoding linguistic phenomena. ISOcat has been succeeded by OpenSKOS.
I have a question about ISOcat. Do you have FAQ's on this topic? Yes, we had a ISOcat section in our FAQ's. You can find the questions (and answers) here (pdf).
^ UP

PIDs

QuestionAnswer
What is a persistent identifier and what should I do for it? A persistent identifier (PID) is a stable (persistent) and unique reference (identifier) to identify a resource, in the case of CLARIN a digital language resource. A well-known example of PIDs outside of CLARIN is formed by ISBN numbers, which are persistent identifiers for books. PIDs for resources are surely needed for tools, applications and services running on the CLARIN infrastructure to provide unique identifiers for resources but they can be useful for humans as well.
Can the title of a resource not serve as its PID? No, a title probably is persistent, but it is not so unique and has other disadvantages. There are cases where two different resources happen to have the same title. But more importantly, titles tend to be long and redundant for humans ("Corpus Gesproken Nederlands"), so that humans start using abbreviated forms ("CGN"), and they are language-dependent, so often translations are also used ("Spoken Dutch Corpus")
Can the URL of a resource not serve as its PID? No, URLs avoid some of the disadvantages of titles, but they tend to be not so persistent (web sites often change and the related URLs change as well or disappear completely). Humans can cope with missing references, computers cannot.
Where do I get a PID for my resource? CLARIN-NL will later this year but ultimately at the start of your project point out a URL and a programming interface where you can get a PID for your resource via a Persistent Identifier Service
What do I have to do to obtain a PID for my resource? Make a request using the Persistence Identifier Service provided by CLARIN-NL later this year. In this request you will be asked to provide some minimal information about your resource such as a small subset of the metadata which you have to provide anyway in the context of your project. The exact nature of this minimal set of resource metadata will be made known ultimately at the start of your project.
How much effort must I plan in my project for obtaining a PID for my resource? It depends a little bit on the nature of your resource, but in general the effort involved will be minimal, typically 1 person day per resource. In general there should be a proper repository system with a software component that requests PIDs automatically when new resources are uploaded.
If I have a PID, what can I do with it? You can use it in programs to uniquely refer to your resource, and the organization that provides a Persistent Identifier Service will make available functionality so that you can click on it in a web browser or another context and it will lead you directly to the resource metadata. However, in most cases, you will identify the resource's metadata in other ways (by searching, querying or browsing in metadata overviews), and the CLARIN infrastructure will use the PID (behind the screens) to get from the resource's metadata to the resource itself.
Where can I obtain more introductory and technical information on PIDs? Here
What is a persistent identifier (PID)? Persistent identifiers are increasingly often seen as core component for all the many references we are creating at various levels - this can range from references between metadata descriptions and their resources up to references between semantic assertions made by using the RDF (Resource Description Framework). For more information please read the requirements specification document or the short guide.
Why do I need PIDs? In the emerging cyberinfrastructure we are creating more and more references between resources, resource fragments and services. The creation of these references is very costly and often is essential for the interpretation of a resource. Therefore we need proper mechanisms to ensure that these references survive despite all the changes that happen in repositories for example. It is known that URLs are not appropriate - they are not persistent even when we believe that they are proper URIs. Therefore special PIDs come into place which identify an object and which are maintained by reliable institutions.
How does it function? Handling PIDs is very simple. First you need to register a PID for a resource or service. You can do this very simply by providing the required information to the PID service site, in particular the path to access the resource such as a URL and you will receive back a PID which you can enter into the metadata description for example, so that everyone can use it for referencing. When a user finds such a PID in a resource, he/she can click on this reference and the service will resolve the PID and give access to (one of the copies of) the resource. Normally as user you don't see the intermediate transactions.
What if the PID service is down? If the PIDs cannot be resolved at a certain moment one simply cannot access a resource. Think of a situation where hundreds of users are waiting on a resolution of a PID and nothing happens - a nightmare for any cyberinfrastructure scenario! Since this would not be acceptable, we need to make sure that the PID service is based (a) on a very robust and reliable software offering sufficient functionality, (b) on a proper service based on redundant centres with a high availability and persistency guarantee.
Is CLARIN offering such service? CLARIN has an arrangement with the EPIC consortium that CLARIN members will be able to register PIDs and of course resolve them. This consortium groups a number of reliable European service providers that want to participate in providing a redundant service for the research world, i.e. we are speaking about millions of PIDs and a service at very low costs. The service is based on the Handle System which according to our investigations is the only robust system meeting all requirements. No one is obliged to register Handles, but of course CLARIN centres will need to demonstrate that their PIDs can be resolved in a robust manner and offer the required functionality.
Why not PURLs, URNs or DOIs? PIDs are as said unique and persistent identifiers of objects that are made available by proper repositories. For many resources there are additional characteristics such as multiple copies for preservation reasons, a string (such as MD5) that can be used to check authenticity, simple metadata for citation purposes, a reference to the access permission record etc. A proper PID system should offer such information immediately when resolving a PID. PURLs can't offer functionality, for URNs we do not know about well-proven and robust resolver, although the big libraries agreed on using URNs for their publications. DOIs are also based on the proven Handle System and it is certainly a proper service which is used in particular by the big publishing companies. However, DOI also comes with a business model that will not be acceptable for may research organizations.
When should I use a part identifier for a PID? (Answer taken from the ISO citer draft, p. 11) This International Standard supports different levels of granularity. The following recommendations are designed to encourage efficiency and promote interoperability with other naming schemes:
  1. If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity should be retained, which is to say that no new PIDs should be issued without very good reasons, such as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the PID of the book.
  2. If the resource is associated with the complete content of a digital file, an individual PID should probably be assigned for this resource.
  3. If the resource is autonomous and exists outside a larger context, an individual PID should probably be assigned for this resource.
  4. If a resource should be citable apart from any containing resource, an individual PID should probably be assigned for this resource. These recommendations are, however, subject to the needs of resource creators with respect to the level of granularity they deem suitable to the specific resource environment.
^ UP

Webservices

QuestionAnswer
What are web services, and how is it relevant to my project? Web services are programs that can be called from other programs that reside somewhere on the World Wide Web. They differ from other programs because (1) they must communicate with other programs, and (2) they must do so over the World Wide Web, which requires special protocols (SOAP is one important example of such a protocol). It is expected that most web applications that consist of two clearly separated parts, viz. a web-based user interface part and a core functionality part with a well-defined API, can be easily turned into web services using a generic wrapper. If your application includes software tools and services you should interact as early as possible with the infrastructure team to chat about the way your software can best be made available within the CLARIN infrastructure to other users as well. Web services are so important because a lot of the functionality that will be offered in the CLARIN infrastructure will be in the form of web services. This will make it possible to set up work flows of interacting programs, e.g. a pipeline of actions that have to be carried out in sequence (e.g. a sequence of text cleaning, text normalization, tokenization, PoS tagging, lexicalization, Named Entity Recognition, full parsing web services applied to a text corpus.)
What is a web service? Services are the on-line equivalent of tools. The difference between a tool and a service is that a tool needs to be run locally (where the data is), while a service runs remotely. When using a web service the input data and the program that does the processing can reside on different machines. The data is transferred via protocols to the remote server, and the output results are transferred back when the processing is complete.
How can I add new web services to CLARIN? If you are a member, you can add tools by filling in this form. You need to specify the type "Web service".
What is CLAM and where can I find more information? CLAM (Computational Linguistics Application Mediator) allows you to quickly and transparently transform your Natural Language Processing application into a RESTful webservice, with which both human end-users as well as automated clients can interact. CLAM takes a description of your system and wraps itself around the system, allowing end-users or automated clients to upload input files to your application, start your application with specific parameters of their choice, and download and view the output of the application once it is completed. More information on CLAM can be found ​here.
^ UP

Formats and standards

QuestionAnswer
Why do we need standards in CLARIN? CLARIN does not create linguistic resources; its purpose is to offer rapid access to the existing resources and to facilitate their reuse in new contexts. When resources and tools are produced for individual usage interoperability and therefore the need to adhere to standards or best practices is of little relevance. The problem of interoperability only emerges when linguists are ready to offer their resources and tools to other researchers. One of the requirements of interoperability is to connect different resources to the same tool. This can be made using standards, but this would imply having all the resources standardized (this is an ideal situation, but cannot always be achieved in reality). When needed, a standard can also play the role of a pivot format (resources are converted to the standard before they are used).
What standards are recommended by CLARIN? More information on this can be found here
Is there a standardization action plan in Clarin? CLARIN actively tracks a number of ongoing standardisation activities at two major levels: linguistic structures/formats and linguistic encoding. CLARIN as an infrastructure project has the duty to evaluate, test and comment these proposals in close relation with the relevant standardisation bodies. When necessary, CLARIN may take the lead in initiating new standardisation activities when a clear gap in coverage is identified. For more information see the CLARIN Standardization Action Plan.
I am a linguist. Do I need to have a working knowledge about all these standards? No linguist should be required to read long documents about standards; it is primarily the task of the tool, service and converter developers to provide frameworks that help the researcher and that hide complex formalisms as much as possible.
My research involves aspects which have not yet been standardized. Can I still make use of CLARIN technology? Very good question. A researcher will always face this issue. The research moves a field on, and in no-man's-land there are no standards (yet). The standards remain behind the research. Industry stays always on ferm ground, therefore on well establishd conventions. Although it appears that reseach has no means to make use of standards, it should base itself on well-established foundations, which should be expressed in standardized form whenever possible. Only for the head of the arrow, the really fresh things, just invented, the researcher should look for his own ad-hoc conventions. Applied to the linguistic data, this means that in an annotated corpus, for example, one will find a mixture of standard and invented markings. CLARIN can be used for that part of processing that involves using existing tools and resources, that have been converted to a standard format.
^ UP

AAI

QuestionAnswer
Where can I find information about the prototype CLARIN "Service Provider Federation"? More information can be found here.
^ UP

CLARIN-compatible

QuestionAnswer
What does "CLARIN-compatible" mean? See ​this document (pdf)
^UP