Advanced tasks

More advanced tasks with Corpus Presenter

Advanced search
Cocoa parameters
Markup codes

Search parameters
Saving returns to disk
Collocations

If a search does not work . . .
Finding strings in returns
Exporting returns as a HTML file
The internal editor

Advanced search level

This level is reached by selecting the option Advanced search level in the Search menu (shortcut: Ctrl-D). The screen changes are you are presented with a new set of options on the top and a text area in the centre of the screen.

The set of functions offered here allows you to locate virtually any string or strings in any texts of a corpus. For this to work properly certain items of information and certain parameter settings are required. The most important of course is the search string itself, or strings if you choose to carry out a double string search. Either or both of the search strings are entered in the boxes provided for this purpose in the retrieval parameters window. Below you will find a discussion of all the relevant options of the advanced search level and a discussion of the effects of specific settings.

Search parameters

The window in the following screen shot contains a variety of options which can be set in various ways. The settings determine the behaviour of Corpus Presenter during a retrieval operation. A parameter is set to ‘on’ by clicking in the small box in front of the text describing it. If set, a tick appears in the box. When this box is empty the parameter is not set.

1) Heed case during search If this parameter is not set then uppercase and lowercase letters are treated in the same manner, that is no distinction is made between capital and small letters. This also applies to any special symbols chosen from the list on the right.

2) List negative finds This option should be used with prudence as it will return all contexts which do not match the search parameters. For a corpus of any size, the results would be enormous and your computer would run out of memory at some stage. The option has only been included for those cases where users really know what they are doing and definitely require negative finds.

3) Double string search This type of search requires two strings, a first one which represents the left-hand section of a syntactic frame and a second one which is the right-hand part. A typical example of a frame would be a phrase or part of a sentence. For instance, if you wish to search for occurrences of do plus have in historical texts of English, say in the Helsinki Corpus, then one might enter the following.

Syntactic frame search

Left-hand string
do

Right-hand string
have

This would return finds like do have, do certainly have, etc. You can furthermore specify whether either or both strings are entire words or only a part of a word, see below.

Special symbols Many corpora contain historical and/or foreign language texts and so you may well wish to search for a string/strings with special symbols in it. The list on the right of the screen allows you to access such symbols. Double click on any symbol and it is entered into the current string. Note the two option buttons below the symbols list. These allow you to specify which of the two input strings a chosen symbol is deposited in.

History When a search is carried out, the string/strings you entered is/are deposited in the history list which appears on the right of the screen. This list can be saved to disk and retrieved at a later point. You can also have more than one history list. When Corpus Presenter is loaded the list used last is automatically retrieved and its contents fill the history array.

4) Use input list 1; Use input list 2 Instead of simply entering a single item for string one or two you may wish to enter a number of forms. It may well happen that the string/word you are looking for – above all in historical corpora – occurs in more than one from. For instance, if you were looking at modal verbs in Middle English then you might want to treat have, haue, hath, has, had, hadde, etc. as instances of the lexeme ‘HAVE’. This is done by creating an input list with all possible spelling variants of ‘HAVE’ and then using this for string one. The same applies to string two; and of course you could have a combination of input forms for string one and for string two.

Using input lists will slow down the performance of Corpus Presenter slightly as it must check on not just two forms for a double string search but on the multiplication of the number of input forms in list one by the number in list two. With modern, faster computers this should not be an issue, however.

5) Find across sentences A syntactic context which you specify for a frame search will probably occur within a sentence. For this reason you are given an option here which is ‘off’ by default. If you wish to deliberately search for a frame which straddles two sentences then set the current parameter to ‘on’.
The set of delimiters for sentences can be edited by the user (see the following screen shot). For instance, if you were dealing with Spanish texts you would want to include the inverted exclamation or question mark symbols as possible sentence delimiters.

6) Allow spaces between strings 1 and 2 A frame search normally aims at returns consisting of several words, i.e. a phrase. However, it is equally possible to search for a word using a frame. For instance, if you wished to find all instances of negated adjectives in a text then you could enter a frame consisting of un and able and specify that intervening spaces are not allowed by removing the tick from the box for the current option. Such as search would return such tokens as unacceptable, unbearable, unthinkable, etc. You could also use an input file list for the beginning of such words. If you had a list with un, in, im, il and a second list with able, ible then you would also find indescribable, impossible, illegible,etc.

7) Number of intervening items If you are preparing a two string search then this parameter is of relevance as it determines how much material can occur between the first and second string for the context to be registered as a successful find. You can furthermore specify whether the intervening items are words or characters. The latter would be significant if the search strings are intended to be part of a single word for a successful find to be returned.
If this parameter is set to 0 then the left and right sections of the frame must be immediately adjacent. The maximum number of intervening items (characters or words) is 64.

8) Amount of context returned When searching for strings, Corpus Presenter can return the context in which it occurred. You can determine how much of this is shown by specifying how many words to the right and left of the string are to be returned.

9) Return whole sentence containing find This can be useful to see what sentences embody a structure which you might be searching for. Bear in mind that a sentence is defined as a syntactic structure which is bound by a sentence delimiter. You can determine the set of such delimiters by editing the appropriate input line on this level of Corpus Presenter.

10) Delete returns from previous search Unless you wish to accumulate returns in a large composite list, you should select the current option. This will ensure that any previous list is deleted before starting a new search. However, if you wish to retain the returns from the previous search then tick the box here. This option only applies to Only applies to RTF Text File and Line List returns. Make sure you do not alter the manner in which returns are shown and you do not exit the advanced search level. You may, however, open the parameters window and change something, such as a string for the search, or choose a new input list.
If you wish to, you can save returns in a single- or multi-line grid and reload at some later point.

11) Find delimiters By default these consist of a left and right angular bracket. You can, however, enter any symbols you like which you feel might visually set off a search string in a return context.

12) String position in word This is a simple parameter which determines whether the units used for a search operation are entire words or only sections or indeed whether this consideration is relevant at all for a search. Basically you can specify that a string is to be treated as an entire word, the beginning or end of a word or specify that it may occur anywhere in a word, i.e. that its status as part of a word is immaterial for the pending search.

Bear in mind that Corpus Presenter uses mechanical means for determining if a string is a word, i.e. it looks to see if the string is preceded by a tab stop or a blank or is the first item on a line and then checks to see if it is followed by a blank, tab, comma, full-stop, colon, semicolon or is the last item on a line. The set of word delimiters can be determined by the user editing the list provided in the relevant text box in the search parameters window. This list is stored to disk and re-read in later work sessions.
The possibilities here can render a search, and hence the returns, more accurate. For instance, if you wished to search for the perfective construction of Irish English as in She’s after selling the car you could enter after as String1 and ing as String2 and specify that the position of the latter is at the end of a word. This would ensure that in a sentence like She’s after bringing the dog only the final ing is returned as a valid find for String 2.
On the other hand you could choose the setting Beginning of word in a case like that discussed above under frame search. If you specified that do was only to be returned if found at the beginning of a word then cases would be registered like don’t which would allow for negated forms of do among your retrieval results.

13) Intervening items The left and right of the frame can be separated by a specifiable number of intervening items (characters or words). If this is set to 0 then the left and right sections of the frame must be immediately adjacent. To allow simple adverbs in the above example you would set the type of intervening item to words and the number to 1. The maximum number of intervening items (characters or words) is 64.

14) Grid options There are a number of ways for Corpus Presenter to deposit returns from a retrieval run. The simplest is as a plain text which can be copied directly via the normal Cut and Paste keys of Windows. The next is as a line list which is slightly more structured in that each return occupies a separate line in the list. The most flexible type of output for retrieval returns is doubtlessly a grid. This is a lattice of rows and columns. There is one row per find. The number of columns depends on the settings for the parameters activated by clicking on the button Grid options (see next two options for details).

a) Multi-line grid This type of repository allows for the context to contain more than one line. For instance, you might find it useful to have several lines before and after the find for a string or strings. This is possible with the current option. There is one important restriction, however: a multi-line grid cannot be saved as a database as the fields of a database can only accept single lines of text.

Possible columns Apart from the column Text section, which is obligatory and which cannot be unchecked, there are a number of other columns which you can add to a multi-line grid by checking them. The columns Location, File name, Node label will include the numeric position of a find in a text, the name of the text file from which it derives and the label for the node in the tree which it occupies. In addition you can add up to 4 user columns. Here you can enter information which you might want to add to that automatically returned by Corpus Presenter.

Marking rows in grids In both the multi-line and the single-line grids you can mark rows discontinuously by holding down the Control key and clicking on a row with the left mouse button. Bear in mind that selected rows can be copied to text at any time by choosing the option Copy to text window which is visible after a set of returns are displayed on this level.

Editing options When you edit a cell in a multi-line grid you will notice that a button appears in the top right-hand corner. If you click this the contents of the first column of the current row appear in a small text window. You can edit a stretch of text, store macros to disk (click button in bottom right-hand corner), load a new file, etc.

b) Single-line grid Here only an amount of the context to the left and right of a find is returned (with a positive find). These returns can be used as the input to a database directly, for instance you could process the results with the database editor in the Corpus Presenter suite.

Possible columns Apart from the column Keyword which is obligatory and which cannot be unchecked, there are a number of others which you can insert for a single-line grid by checking them. The left and right flank for a find (trimmed, i.e. with trailing or preceding blanks removed, or not) can be included and you can specify whether the delimiters, set in the main search parameters window, should also be included. The columns Location, File name, Node label will include the numeric position of a find in a text, the name of the text file from which it derives and the label for the node in the tree which it occupies. In addition you can add one or two user columns. Here you can enter information which you might want to add to that automatically returned by Corpus Presenter.

Sorting a single-line grid One inherent advantage of the single-line grid is that it allows one to sort any field in either ascending or descending order. All you do is click on a column heading and the entire grid is sorted according to this column. The sort option is a toggle: clicking a heading once will lead to an ascending sort on that column, clicking again will cause the data to be sorted in descending order.

c) Separators between output fields There are different ways of separating the fields for each return (assuming that the repository type is a text). Four common options are offered here, along with the choice of having no separator. These various types are a matter of personal taste; it is best just to try them out and see how you find them.

d) Output file in Corpus Presenter Table Editor format There is a supplied program for editing tables generated on the advanced search level, namely Corpus Presenter Table Editor. If you choose to save returns in this format then you can edit these separately with the table editor and export them from there to a text editor. One of the advantages of this is that the table editor can handle several tables at once. If you have several return sets you can edit these as a group later independently from Corpus Presenter.
For this option one must use the Multi-line grid for returns. A set of returns with this storage mode would look like the following.

Exporting rows from a single- or multi-line grid can be sensitive to three settings which can be seen in the following screen shot. Essentially, you can specify that only selected rows are exported, that only the text in returns, i.e. no information about location and filename, and that only those columns of returns which are currently visible are exported.

When you are viewing returns (see previous two screen shots) the large buttons at the top of the screen change. On the right-hand side is one labelled Goto text. If you click on this then the returns window is hidden and the text window beneath shows the text where the currently highlighted find was made. The find itself is selected (shown as white lettering on a black background). You can bring the returns window to the foreground again by clicking on the button Last results.

Cocoa parameters

One means of specifying various items of information about a corpus text is to mention these in a header at the beginning of each file. A system which is quite widespread among corpora is the Cocoa parameter set. This consists of up to 32 parameters with typical settings for certain file types. For instance, the texts of the Helsinki Corpus are all encoded with a Cocoa header in which information is given about a following text. The settings can be used in Corpus Presenter to determine what files are examined during a retrieval operation. The way this is done is outlined in the following.

To determine a setting you copy a Cocoa parameter from one of the text files of your corpus, say or , into the Windows clipboard by marking the line and pressing Ctrl-C when you are in a text program such as the Corpus Presenter Text Tool. Now move to the current window, click the text line at the bottom of the screen and retrieve the contents of the clipboard with Ctrl-V. You then double click the position in the settings list on the right-hand side of the screen where this parameter belongs. Repeat this procedure for as many Cocoa settings as you require for the impending search. Settings which are empty will be ignored.

1 <B = ‘name of text file’>

2 <Q = ‘text identifier’>

3 <N = ‘name of text’>

4 <A = ‘author’>

5 <C = ‘part of corpus’>

6 <O = ‘date of original’>

7 <M = ‘date of manuscript’>

8 <K = ‘contemporaneity’>

9 <D = ‘dialect’>

10 <V = ‘verse’ or ‘prose’>

11 <T = ‘text type’>

12 <G = ‘relationship to foreign original’>

13 <F = ‘foreign original’>

14 <W = ‘relationship to spoken language’>

15 <X = ‘sex of author’>

16 <Y = ‘age of author’>

17 <H = ‘social rank of author’>

18 <U = ‘audience description’>

19 <E = ‘participant relationship’>

20 <J = ‘interaction’>

21 <I = ‘setting’>

22 <Z = ‘prototypical text category’>

23 <S = ‘sample’>

24 <P = ‘page’>

25 <L = ‘line’>

26 <R = ‘record’>

Note that many nodes in a tree may contain a reference to a file DUMMY.RTF. This is an empty file which is used as a placeholder only and is ignored in all retrieval operations.

Saving returns to disk

Once you have a set of returns (from a retrieval run) you can choose to save them to a text file. Note that you may determine here whether to save all rows in a grid or just those which you have selected (highlighted) in the grid. There are four basic types of file which can be used as output on disk.

File type File extension

1) Plain text file (simplest form) .OUT

2) Single-line grid (one row per line) .GRD

3) Reloadable Table Editor format .TBX

4) Database (finds are stored as records) .DBF

The choice of file type depends ultimately on what you want to do with the results. If your aim is to have the results transferred to a table in a text file then you should choose type (3). If you wish to process the data in a database management environment, then choose type (4). Should you wish to have the returns in a grid in which there is one row for each find and where columns can be determined flexibly by the user, then choose type (2). Type (1) is the simplest of all and can be used where you are just experimenting; this storage type for returns is the fastest of all as the returns do not have to be formatted.

When you are saving returns to disk, you can choose to only save those returns (from any kind of grid) which you have selected. Furthermore, if the returns have been deposited in a multi-line grid, you can also decide whether the contents of each column or only those columns which are visible are to be saved. To hide some columns of a multi-line grid from view, choose the option Columns to view in the Display menu on the Advanced search level.

Markup codes

Markup codes, e.g. text formatting or comments, are stretches of text which are ignored during searches by Corpus Presenter. You can specify what delimiters are to be recognised as markup codes in your files. You can also specify a particular code which marks an entire line as a comment. Tick on the relevant box to have markup codes ignored during searches. Note, however, that checking for markup codes can slow down searches somewhat, especially if they range across an entire corpus of considerable size.

Collocations

Assuming that you saved your returns to a line grid in a similar manner to those in the following screen shot then you have the additional option of letting Corpus Presenter analyse these returns for the collocations in which they occur. You choose the option Determine collocation in the Profile menu or by pressing F12.

You then specify between 1 and 8 words on one or both sides of the keyword for the restructuring of the returns. The program will reorganise the finds and display them in a grid in which the words to the left and right of the keyword are deposited in separate columns. You can sort the grid according to any column by just clicking on the column header. The percentages for the words flanking the keyword are shown in the lower half of the collocations window. This text and any selected rows of the collocations grid can be transferred to the Windows clipboard for retrieval into your own software.

Advanced search	Cocoa parameters	Markup codes
Search parameters	Saving returns to disk	Collocations

1	<B = ‘name of text file’>
2	<Q = ‘text identifier’>
3	<N = ‘name of text’>
4	<A = ‘author’>
5	<C = ‘part of corpus’>
6	<O = ‘date of original’>
7	<M = ‘date of manuscript’>
8	<K = ‘contemporaneity’>
9	<D = ‘dialect’>
10	<V = ‘verse’ or ‘prose’>
11	<T = ‘text type’>
12	<G = ‘relationship to foreign original’>
13	<F = ‘foreign original’>
14	<W = ‘relationship to spoken language’>
15	<X = ‘sex of author’>
16	<Y = ‘age of author’>
17	<H = ‘social rank of author’>
18	<U = ‘audience description’>
19	<E = ‘participant relationship’>
20	<J = ‘interaction’>
21	<I = ‘setting’>
22	<Z = ‘prototypical text category’>
23	<S = ‘sample’>
24	<P = ‘page’>
25	<L = ‘line’>
26	<R = ‘record’>

	File type	File extension
1)	Plain text file (simplest form)	.OUT
2)	Single-line grid (one row per line)	.GRD
3)	Reloadable Table Editor format	.TBX
4)	Database (finds are stored as records)	.DBF