Corpus Studio WebSummary
CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists.Background
CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. It does so by supporting researchers in writing queries that operate on syntactically parsed text corpora in a number of major xml formats. Queries that belong together are kept in xml documents that are called ‘Corpus Research Projects’ (CRPs). These documents contain the queries, the order in which they are to be executed, meta-information about the queries and the project as a whole, as well as a specification of the input used for the project. The use of CRPs helps improve the replicability of corpus research.
Any CLARIN-NL user can access the CorpusStudio web application and make use of the 'standard' corpora. New users must provide a login name and password, after which they can make use of the application.
The CorpusStudio code is open-source. Users can take the code, adapt it and use it for their own purposes. Users can also take the code from GitHub as it is, but build their own server in order to run the application on their own text-corpora. User documentation and an API are available (see below). The current version of CorpusStudio supports xml text corpora in the FoLiA and Psdx formats. Extensions to other xml formats are possible.
- Keep all important aspects of a research project in one file
- Define one or more search queries in a hierarchy
- Uses w3c developed Xquery and Xpath
- Integrated CorpusStudio-specific Xquery functions
- User-definable functions and variables
- Create corpus result databases with user-definable features accompanying each hit
- Divide the output into calculatable categories
- Divide the results into meta-data-dependent groups
- Parallel processing yields a speed-up of a factor 20-100 compared to the Windows version
- Compatibility with the Windows programs "Cesax" and "CorpusStudio"
Limitations and future developments
Current limitations to the program include: working with result database, restricted login system, no document view, grouping is restricted to system-defined groups, no query or project wizard. Although the CLARIN-NL project has stopped in December 2015, every effort will be undertaken to make sure that a number of essential features are going to be added.Contacts
- Project leader: Erwin Komen
- CLARIN center:
- Help contact: E.Komen@Let.ru.nl
- User scenario's (screencasts, screenshots): n.a.
- Komen, Erwin R. 2015. "Corpus Studio: the web application". User documentation. version 1.7. Meertens Instituut, Amsterdam. http://hdl.handle.net/21.11114/COLL-0000-000B-C289-F
- Komen, Erwin R. 2016. "An API for the CorpusStudio web application". version 1.3. Meertens Instituut, Amsterdam. http://hdl.handle.net/21.11114/COLL-0000-000B-C288-0
- Tool/Service link: http://hdl.handle.net/21.11114/COLL-0000-000B-C287-1
- Komen, Erwin R. 2011. Coreferenced corpora for information structure research. In Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources. (Studies in Variation, Contacts and Change in English 10) Jukka Tyrkkö, Terttu Nevalainen, Matti Rissanen & Matti Kilpiö (eds). Helsinki, Finland: Research Unit for Variation, Contacts, and Change in English.
- Komen, Erwin R. 2013. Finding focus: a study of the historical development of focus in English. Utrecht: LOT.
- Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian academy of sciences.