5 October 2016
types2 is a freely available corpus tool for comparing the frequencies of words, types, and hapax legomena across subcorpora. The tool uses accumulation curves and the statistical technique of permutation testing to compare the subcorpora with a “typical” corpus of a similar size, in order to visualize the frequencies and to identify statistically significant findings. The new version of the tool makes it easy to explore the results with the help of an interactive web user interface that provides direct access to relevant findings, visualizations, metadata, and corpus texts (Figure 1).
The basic principles of the tool, as well as how to use it to explore and analyse corpus data, are explained in the paper “types2: Exploring word-frequency differences in corpora” (Säily & Suomela 2016). In this manual we will look at the more technical part: given your data, how to enter it in types2, run the data analysis, and start to explore the results.
In this manual, we will use the same running examples and data sets as in the above paper:
The results that you see in the web user interface (see Figure 1 for an example) are precomputed. Permutation testing is computationally intensive, and we do not want to redo it in the web browser every time we start to explore the results.
Our tool consists of two separate parts, which we will refer to as the backend and the frontend. The frontend is the web interface that we have already seen; the backend is the set of tools that performs the heavy computations. The backend and the frontend are of a fairly different nature, as summarized in the following table.
Backend | Frontend | |
---|---|---|
User interface | command-line tools that you run in terminal | web pages that you access with a web browser |
Supported platforms | OS X and Linux | modern web browsers |
Programming languages | C and Python | JavaScript |
APIs, standards, libraries, and external resources | OpenMP, SFMT, SFMTJump (Saito & Matsumoto 2008) |
HTML5, SVG, D3.js, Source Sans Pro, Google Fonts |
Data storage | SQLite database, read-write | JSON-formatted text file, read-only |
Running time | may take several minutes | instantaneous |
For any given data set and research question, you only need to run the backend once. After that, you will have all your results in one folder, together with the web user interface, and you can easily distribute it among your colleagues and even make it available online – just like we have done with our sample data set.
From the user’s perspective, the key component of the backend is the SQLite database that is used to store information. An SQLite database is just a single file. In our case the file is usually named db/types.sqlite
. This is the database in which you will enter all of the input data for the backend to process, and this is also the database that will eventually contain the end results of the analysis.
SQLite databases are very widely supported. Numerous programming languages support accessing SQLite databases; for example, in Python scripts you can use the standard module sqlite3
to read and write data in SQLite databases. On OS X and Linux you can use the command line tool sqlite3
to access the database. There are also graphical user interfaces for exploring SQLite databases and for importing and exporting data; one option for OS X users is Base.app.
Our tool is freely available for download at the following sites:
The software distribution comes with up-to-date installation instructions. In brief, you will need a Unix-like operating system (e.g., OS X or Linux), a modern C compiler, Python 2, and SQLite development libraries. After downloading the software, you can configure it with the command ./config
, and compile it by running make
. Now you should have the backend of types2 ready for use in the bin
subdirectory.
A typical workflow consists of the following steps:
db/types.sqlite.
bin/types-runto perform data analysis.
bin/types-webto create the web user interface.
After these steps, you should have a new subdirectory web
, which will contain the frontend of types2, together with the end results of the data analysis, ready for use. You can simply point your web browser to web/index.html
to explore the results. You do not need the database any more at this point, but naturally you should keep it in safe place in case you need to e.g. fix mistakes in the input and re-run the backend.
Typically, the most challenging part in the use of the tool is gathering your input data and getting it in the right format so that you can enter it in the database. To that end, we will need to understand the basic principles of how types2 works.
The key concept that one needs to understand is a sample. This is the smallest independent chunk of your corpus. The only modelling assumption that our software makes is that the samples are independent of each other. All data analysis is done at the level of samples. A typical choice in a sociolinguistic study would be:
Your corpus consists of all samples. You can then form any number of collections that consist of the samples that are of particular interest. In a sociolinguistic study, collections would typically reflect the social metadata associated with your samples, for example:
In essence, types2 will form a very large number of random collections and compare them with the collections that you have specified in the database. In the CEEC example, we can then see if, say, the number of -ness types in the collection of letters written by women is low or high in comparison with random collections of the same size. The null hypothesis here would be that type richness in the collection of letters written by women is no different from the type richness of a random collection.
For such an analysis to make sense, it is essential that the samples are indeed (relatively) independent of each other. In particular, the samples must not be too small. For example, one-sentence samples would typically be far too small, as individual sentences within the same text are not independent of each other. In the CEEC example, should we split our entire corpus in sentence-sized samples, we would get meaningless results. In brief, the issue would be that the random collections would be formed by freely mixing together individual sentences from different letters and by different authors, while the collections prepared by us would always keep each letter intact. Hence it would be only expected that the random collections would be more diverse.
We suggest the following rule of thumb for splitting your corpus in samples: if you randomly shuffle the order of the samples in your corpus, the end result would still make sense to a human reader. Naturally, this is not always easy to achieve, and e.g. in our BNC study we had to use a bit more creative approach.
The final concept that we will need is a token. A token is a single occurrence of a word of interest in the corpus. For example, in the CEEC study, each word formed with the suffix -ness is a token. For each token, we need to define a tokencode, which is a normalized form of the word. All occurrences with the same tokencode count as one type. A typical choice would be:
For example, in the CEEC study, we have normalized the spelling and lemmatized the words so that all spellings of e.g. the noun “witness”, in singular or plural, count as the same token. In that case, tokencode is simply the lemma.
Now we are ready to describe how to enter your data in the SQLite database used by types2. As a starting point, you can use the empty database that you will find in template/types.sqlite
: you can simply copy the template to db/types.sqlite
and start to fill it in.
The database structure is flexible enough that we can analyse multiple linguistic features at one go (for example, the diversity of both -ness and -ity), as well as to perform the same analysis for multiple unrelated corpora. However, to keep things simple, we will explain the case of a single corpus and single dataset. In this case:
Now we are ready to specify the samples and collections:
Finally, we list all occurrences of the tokens:
This is all that we need for the backend to make its analysis. By default, it will analyse the number of types, but you can also request additional data analysis by adding any of the following codes to the table defaultstat:
type-word: analyse types / running words
type-token: analyse types / tokens
hapax-word: analyse hapaxes / running words
hapax-token: analyse hapaxes / tokens
token-word: analyse tokens / running words.
To make it easier to explore the results in the frontend, you can also provide more information on the occurrences of the tokens. To that end, you can optionally fill in the table context, in which you can list the occurrences in a format familiar from concordance tools.
Let us now have a closer look at how we have populated the database in our BNC case study. While reading this section, you can also explore the data in the web interface, or you can download the entire database from the following Git repositories:
In the Git repository, the directory empty
contains the database template, bnc-input
contains the database with all input data ready for analysis, and bnc-output
shows what the database looks like after running the backend.
We combine data from both BNCweb and MorphoQuantics. The process is semi-automatic: We first wrote a script that tries to match BNCweb search results with MorphoQuantics data, as well as possible. After this step, we were left with ambiguities, which were manually resolved by pruning and normalizing data in Excel. Finally, our scripts combined the input from all three data sources – BNCweb, MorphoQuantics, and Excel – and used it to populate the types2 database. You can find the Python scripts that we used at the following repositories:
Most of the code is related to the task of making the best possible use of MorphoQuantics data automatically. In brief, for each POS-tagged word, we compare the numbers in MorphoQuantics data with the total number of hits in BNCweb search results, and if these are (almost) identical, we will use the MorphoQuantics classification. Once the data is in a clean, tabular form, the remaining task of inserting it in the correct database tables is straightforward.
In what follows, we will show examples of precisely what data we have in each database table.
corpus
In this study, we use just one corpus, the demographically sampled spoken component of the BNC. We will use the code bnc-spoken-demo
for this corpus.
corpuscode |
---|
bnc-spoken-demo |
dataset
We want to explore both -er and -or separately. Furthermore, we want to see statistics for each suffix in general, as well as for their animate (person
) and inanimate (thing
) senses separately. Hence in the end we will have six different datasets. Note that all data from e.g. er-person
and er-thing
will also be duplicated in the dataset er
.
corpuscode | datasetcode |
---|---|
bnc-spoken-demo | er |
bnc-spoken-demo | er-person |
bnc-spoken-demo | er-thing |
bnc-spoken-demo | or |
bnc-spoken-demo | or-person |
bnc-spoken-demo | or-thing |
sample
The key challenge with the spoken BNC data is to define meaningful samples. On the one hand, each sample should contain spoken language from only one person, and the samples should be meaningful entities that are preferably as independent from each other as possible. On the other hand, natural dialogue by its very nature interleaves language spoken by different people.
In the end, we have chosen to define our samples so that one sample corresponds to one speaker. For example, our sample PS007
consists of everything that was recorded from the informant whose speaker ID in the BNC is PS007
. As we are interested in sociolinguistic aspects, we have discarded speakers whose gender or social class was unknown.
For the purposes of analysing types vs. running words, it is necessary to have information on word counts for each of the speakers. The descriptions and hyperlinks are optional information, but this information makes it much easier to explore and interpret results in the web frontend.
corpuscode | samplecode | wordcount | description | link |
---|---|---|---|---|
bnc-spoken-demo | PS007 | 6987 | AB Male 60- | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB0&spid=PS007&urlTest=yes |
bnc-spoken-demo | PS01A | 13306 | DE Male 45-59 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB1&spid=PS01A&urlTest=yes |
bnc-spoken-demo | PS01B | 12587 | DE Female 45-59 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB1&spid=PS01B&urlTest=yes |
bnc-spoken-demo | PS01D | 3993 | DE Female 15-24 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB1&spid=PS01D&urlTest=yes |
bnc-spoken-demo | PS01T | 7025 | DE Male 60- | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB2&spid=PS01T&urlTest=yes |
bnc-spoken-demo | PS01V | 16711 | DE Female 60- | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB2&spid=PS01V&urlTest=yes |
bnc-spoken-demo | PS028 | 1442 | C2 Male 15-24 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB5&spid=PS028&urlTest=yes |
bnc-spoken-demo | PS029 | 8397 | DE Female 25-34 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB6&spid=PS029&urlTest=yes |
bnc-spoken-demo | PS02A | 221 | DE Male 25-34 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB6&spid=PS02A&urlTest=yes |
bnc-spoken-demo | PS02F | 406 | DE Male -14 | http://bncweb.lancs.ac.uk/cgi-binbncXML/speakerInfo_new.pl?text=KB6&spid=PS02F&urlTest=yes |
… |
collection
Our goal is to explore two social factors: gender and social class. To this end, we define four collections: women (Female
), men (Male
), higher social classes (AB+C1
), and lower social classes (C2+DE
). For example, in the collection C2+DE
we have put together all speakers whose social class in the BNC metadata was either C2
or DE
– this roughly corresponds to the working class.
We use the groupcode
column to group together those collections that are related to gender, and those collections that are related to social class. This is mainly for convenience so that we can more easily explore either of these dimensions separately in the web frontend.
corpuscode | collectioncode | groupcode |
---|---|---|
bnc-spoken-demo | Female | gender |
bnc-spoken-demo | Male | gender |
bnc-spoken-demo | AB+C1 | social class |
bnc-spoken-demo | C2+DE | social class |
sample_collection
For example, the informant PS007
was a man with social class AB
in the BNC metadata. Hence in our database, the sample PS007
will be in two collections: Male
and AB+C1
. Hence we insert two rows in the table sample_collection
.
corpuscode | samplecode | collectioncode |
---|---|---|
bnc-spoken-demo | PS007 | AB+C1 |
bnc-spoken-demo | PS007 | Male |
bnc-spoken-demo | PS01A | C2+DE |
bnc-spoken-demo | PS01A | Male |
bnc-spoken-demo | PS01B | C2+DE |
bnc-spoken-demo | PS01B | Female |
bnc-spoken-demo | PS01D | C2+DE |
bnc-spoken-demo | PS01D | Female |
bnc-spoken-demo | PS01T | C2+DE |
bnc-spoken-demo | PS01T | Male |
… |
token
For each sample, we extract all relevant occurrences of the words formed with -er or -or. The table token
gives an overview of the findings. For example, the speaker PS007
used the word brainstormer once and the word computer four times in total.
corpuscode | samplecode | datasetcode | tokencode | tokencount |
---|---|---|---|---|
bnc-spoken-demo | PS007 | er | brainstormer | 1 |
bnc-spoken-demo | PS007 | er | carrier | 1 |
bnc-spoken-demo | PS007 | er | checker | 1 |
bnc-spoken-demo | PS007 | er | computer | 4 |
bnc-spoken-demo | PS007 | er | copier | 1 |
bnc-spoken-demo | PS007 | er | farmer | 1 |
bnc-spoken-demo | PS007 | er | freezer | 1 |
bnc-spoken-demo | PS007 | er | lever | 1 |
bnc-spoken-demo | PS007 | er | passenger | 2 |
bnc-spoken-demo | PS007 | er | preacher | 2 |
… |
context
Our backend only needs the summary information that is given in the table token
, but for the benefit of a user exploring the results in the web frontend, we also add more detailed information on each occurrence of each token. For example, while we had just one row in the table token
for the word computer in sample PS007
, there will be four distinct rows in the table context
, one for each occurrence.
corpuscode | samplecode | datasetcode | tokencode | before | word | after | link |
---|---|---|---|---|---|---|---|
bnc-spoken-demo | PS007 | er | brainstormer | So , in the next elders meeting we used a management technique , | brainstormers | think the say what you think . | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=1078&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | carrier | No , I 'll put them in this , why use up your | carrier | bags , anyhow this is an easier way to carry | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=199&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | checker | And a computer you 've got a | spellchecker | . | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=3303&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | computer | We just want a bit of information about setting up a | computer | . | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=2877&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | computer | A | computer | and a printer , mm . | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=3321&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | computer | You see the beauty of the beauty of the | computer | or wordprocessor compared to an electronic typewriter is you can finish your document and then do your spellcheck . | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=3325&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | computer | And a | computer | you 've got a spellchecker . | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=3303&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | copier | Oh I see , it 's not an ordinary | photocopier | ? | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=3237&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | farmer | Because there are worries that it 's going to er , effect the outcome of this mad ref , referendum , you know they the | farmers | are sort of desperate and they think they need government help and they think they would n't get it because of the black government so | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=620&spids=1&interval=11&first=yes&urlTest=yes |
bnc-spoken-demo | PS007 | er | freezer | Er , half a pound will do , no give me a pound , I 'll put them in the | freezer | and | http://bncweb.lancs.ac.uk/cgi-binbncXML/context.pl?text=KB0&qname=nosol&refnum=0&theData=0&len=0&showTheTag=0&color=0&begin=186&spids=1&interval=11&first=yes&urlTest=yes |
… |
defaultstat
In this study, we are interested in exploring all the statistics that our software supports. Hence we add the following codes to the table defaultstat
.
statcode |
---|
hapax-token |
hapax-word |
token-word |
type-token |
type-word |
Let us now discuss briefly the key technical aspects of the CEEC example. The database for this study is also available in the following Git repositories:
The directory ceec-input
contains the database with all input data ready for analysis, and ceec-output
shows the results after running the backend. As noted before, you can explore the results online.
We will not discuss all database tables here, as the basic idea is similar to that of the BNC study. However, we will highlight some key differences.
corpus
In this database, we define two different corpora. The source data is the same in both cases, but the way in which the text is divided into samples varies.
In the corpus ceec-1680-1800
we use the fairly natural choice that a sample consists of all letters written by one person during a 20-year period; this enables us to study how the diversity of the language varies between different writers and how it varies between different time periods.
In the corpus ceec-1680-1800-relcode
we have further divided the samples based on the relationship code in the CEEC metadata. This code describes the relationship between the sender and recipient of the letter. This corpus hence makes it possible to study how the relationship influences the way in which language is used in letters.
corpuscode | description |
---|---|
ceec-1680-1800 | CEEC & CEECE, 1680-1800, sample = letters with a certain sender & period |
ceec-1680-1800-relcode | CEEC & CEECE, 1680-1800, sample = letters with a certain sender & period & relcode |
sample
In the corpus ceec-1680-1800
, each sample is named after the person code in the CEEC metadata, followed by a number that refers to a 20-year period. We link each sample to information on the person in the CEECer web interface.
corpuscode | samplecode | wordcount | link |
---|---|---|---|
ceec-1680-1800 | A1HATTON-1680 | 889 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=A1HATTON |
ceec-1680-1800 | A1MONTAGUE-1680 | 1517 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=A1MONTAGUE |
ceec-1680-1800 | A2HATTON-1680 | 478 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=A2HATTON |
ceec-1680-1800 | A2HATTON-1700 | 287 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=A2HATTON |
ceec-1680-1800 | A2MONTAGUE-1680 | 473 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=A2MONTAGUE |
ceec-1680-1800 | AAHANOVER-1780 | 3797 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=AAHANOVER |
ceec-1680-1800 | ABANKS-1700 | 266 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=ABANKS |
ceec-1680-1800 | ABATHURST-1700 | 7505 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=ABATHURST |
ceec-1680-1800 | ABATHURST-1720 | 8891 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=ABATHURST |
ceec-1680-1800 | ACLARK-1760 | 215 | http://h89.it.helsinki.fi/ceec/func/personFunc.jsp?personID=ACLARK |
… |
context
The following glimpse at the table context
illustrates the way in which tokens are normalized in the CEEC study. Here the word
column contains the token as it was written in the corpus, while tokencode
gives the normalized version that will be used in the analysis. Here the normalization process was mostly manual work: we extracted raw concordances from the corpus into Excel and edited and pruned them there. The final results were then combined with the CEEC metadata and imported into the database.
corpuscode | samplecode | datasetcode | tokencode | before | word | after | link |
---|---|---|---|---|---|---|---|
ceec-1680-1800 | A1MONTAGUE-1680 | ity | opportunity | Epsom watters, that he might not grow to fatt, and to buy some little mourninge, made mee take this | oppertunitie | . The towne is very | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_075 |
ceec-1680-1800 | A1MONTAGUE-1680 | ity | parity | was unsattisfied, except those that out of favour to some of the | parity | might wish it orther wayes. The scandall laid upon y=e= wittnesses was much taken offe, and, untill I heare w=t= y=e= L=ds= have done, I will say noe more as to my oppinion. I | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_072 |
ceec-1680-1800 | A1MONTAGUE-1680 | ity | quality | Here is strange neues from France. A great many people of the best | quality | accused for pois[{on{]ing. Madame de Sowison (Madame Mazerine sister) hath made her escape. She is said to have bine the death of her husband. Severall more w=ch= was | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_067 |
ceec-1680-1800 | A1MONTAGUE-1680 | ness | business | mee. I came w=th= my L=d= last weeke to this place, and designe w=th=ing two dayes to returne to Leez. His | bisnes | was to take Epsom watters, that he might not grow to fatt, and to buy some little mourninge, made mee take this oppertunitie. The towne | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_075 |
ceec-1680-1800 | A1MONTAGUE-1680 | ness | illness | relations out of this family, I cannot but be frighted w=th= the least rumore of | illness | in one soe very neare unto mee, and for whom I have soe tender an affectione, w=ch= she justly merritts from mee. | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_075 |
ceec-1680-1800 | A1MONTAGUE-1680 | ness | kindness | will gaine a perfitt health before she proove w=th= childe. Your L=dsp= care and | kindness | she hath soe often founde, that all must be fully sattisfied nothing could bee wantinge was to be had in order to helpe. Wee have lost lately soe many relations | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_075 |
ceec-1680-1800 | A1MONTAGUE-1680 | ness | witness | parity might wish it orther wayes. The scandall laid upon y=e= | wittnesses | was much taken offe, and, untill I heare w=t= y=e= L=ds= have done, I will say noe more as to my oppinion. I cannot but looke upon it as a wonderful providence of God Allmighty | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON_072 |
ceec-1680-1800 | A2HATTON-1680 | ity | quality | Ibbings who danced at Lincolns Inn Feild and is lately dead. But as y=e= | quallity | of y=e= Ladys that dance at Court is not to be compared w=th= so mean a pe | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON2_072 |
ceec-1680-1800 | A2HATTON-1680 | ness | bigness | hear in them y=t= a good one cant cost les then 60 pound, and y=t= monstros | bignes | w=th= his lettle face did not look so well. I hear Lady Banbery is dead, a | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON2_073 |
ceec-1680-1800 | A2HATTON-1700 | ness | business | married, which I think should make this young lady afraid. I have so much | buisness | here y=t= I hope my Lady will excuse me till next post. I beg my duty to h | http://h89.it.helsinki.fi/ceec/func/letterFunc.jsp?letterID=HATTON2_075 |
… |
The previous sections have given two examples of how to populate the database. Now let us assume that we have a database ready for analysis in the file db/types.sqlite
, which is the default location used by the types2 backend. From the user’s perspective, the remaining steps are easy – just run the following two commands:
bin/types-run bin/types-web
The first command will read the input from the database, do the data analysis, and write the results back to the database. You will find lots of new data in the tables log
, result_curve
, result_curve_point
, result_p
, and result_q
. The second command will construct the web frontend. Now you can open the file web/index.html
in your web browser.
While running the backend is easy for the user once we have the data in the database, it is not that easy for the computer. The tool relies on Monte Carlo permutation testing (Mitzenmacher & Upfal 2005: 252), and with the default parameters, it constructs ten million random permutations of each corpus, for each data set. While the implementation is highly optimized C code, this inevitably takes a fair amount of time.
With our sample data sets, on a modern desktop computer, both the BNC data and the CEEC data took some 5–6 minutes to process. To speed things up, you can use a high-end computer with a multi-core processor – the tool automatically parallelizes computations for any number of processors.
However, when you are setting up your database, it is convenient to perform a quick, less thorough computation to make sure that you have configured everything correctly. To this end, you can use smaller values for the citer
and piter
parameters:
bin/types-run --citer=100000 --piter=100000 bin/types-web
With these parameters, we only construct some 100,000 random permutations. It may not be a good idea to count on the results that you obtain this way – especially the estimates of the p-values may be a bit off – but you can nevertheless get a rough approximation of the results that you will eventually get. Once you are happy with your database, you can recalculate everything with the default settings:
bin/types-run --recalc bin/types-web
We conclude with a complete example of how to download the software and sample data, how to compile and run it, and how to start to explore the results. This example was tested on Mac OS X, but the usage on Linux is similar.
git clone https://github.com/suomela/types.git git clone https://github.com/suomela/types-examples.git cd types ./config make mkdir db cp ../types-examples/bnc-input/db/types.sqlite db/types.sqlite bin/types-run --citer=100000 --piter=100000 bin/types-web open web/index.html
You can get more information on the usage of the backend with the command line switch --help
:
bin/types-run --help bin/types-web --help
The authors wish to thank CSC – IT Center for Science, Finland, for computational resources. This work was supported in part by the Academy of Finland grant 276349 to the project ‘Reassessing language change: the challenge of real time’.
Data cited herein have been extracted from the British National Corpus, distributed by the University of Oxford on behalf of the BNC Consortium. All rights in the texts cited are reserved.
BNC = The British National Corpus, version 3 (BNC XML edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk
CEEC = Corpora of Early English Correspondence. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg et al. at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/
Mitzenmacher, Michael & Eli Upfal. 2005. Probability and computing: randomized algorithms and probabilistic analysis. Cambridge: Cambridge University Press.
Säily, Tanja & Jukka Suomela. 2016. types2: Exploring word-frequency differences in corpora. https://jukkasuomela.fi/types2-d2e/
Saito, Mutsuo & Makoto Matsumoto. 2008. SIMD-oriented fast Mersenne twister: a 128-bit pseudorandom number generator. In Alexander Keller, Stefan Heinrich & Harald Niederreiter (eds.), Proc. 7th international conference on Monte Carlo and quasi-Monte Carlo methods in scientific computing (MCQMC 2006), 607–622. Berlin: Springer. doi:10.1007/978-3-540-74496-2_36
Suomela, Jukka. 2016. types2: type and hapax accumulation curves. Computer program. https://jukkasuomela.fi/types2/