Regular Expressions
for Bible References
Regular Expression, often shortened as "regex", is a powerful tool used for pattern matching in text. In simple terms, it is a sequence of characters that define a search pattern. For this, a special syntax it is used, making the work sometimes quite complex.
A good exercise with Regular Expression is to find a mininum solution that define the pattern of the sofisticated reference system for the Bible. This has many applications such as automatic set of tags to the references of a given text. The goal of this article is to support the Brazilian Portuguese referencing.
First, a general form was defined. Then, a mininum set of test cases was enumerated using testing techniques of Boundary Value Analysis and Equivalence Partitioning. The problem was divided in two: (1) book pattern and (2) numbering pattern. The solution for each of them was developed following Test-Driven Development methodology. Finally, the final Regular Expression was build, tested, and analysed for improvements.
Development used JavaScript on top of Node.js runtime environment using the Jest framework for testing. These technologies were choosen for simplicity.
Definition of general form
The Bible is an ancient collection of religious documents that were aggregated in books. Biblical chapters are introduced by Sthepen Langton in 1227. Ultimately, verses are introduced by Robert Stephanus in 1551. Biblical references are based on these levels to locate parts of the holy text.
These parts could be understood as intervals of the smallest level, i.e., intervals of verses. Even sole verses in this case would be degenerate intervals. The concept of intervals helps to define a general form for references.
Biblical references are just a set of intervals:
INTERVAL [, INTERVAL...]
An interval is compound of left and right endpoints, which coincides in the case of degenerate interval. Endpoints references a location of units at the smallest level, i.e., verses. When information for verse or chapter is missing, the first ones are considered.
INTERVAL := [LEFT_UNIT, RIGHT_UNIT]
To locate a unit, three pieces of information are required: book, chapter, and verse. There are some cases where chapter information is omitted when there is only one.
Matching Bible books in Portuguese
The 66 books of the Protestant Bible were considered for this work. This is because the most read Bible version in Brazil is the protestant translation by João Ferreira de Almeida (1618-1701), although others books such as the Deuterocanonical books could be also added.
Each book has an identifier. Portuguese identifiers were collected from the Biblical Society of Brazil, owner of Almeida's translation. Books are frequently identified as shown in Table 1.
Table 1. Portuguese identifiers for Biblical books
USFM | EN title | PT abbr. | PT titles |
---|---|---|---|
GEN | Genesis | Gn | Gênesis |
EXO | Exodus | Êx | Êxodo |
LEV | Leviticus | Lv | Levítico |
NUM | Numbers | Nm | Números |
DEU | Deuteronomy | Dt | Deuteronômio |
JOS | Joshua | Js | Josué |
JDG | Judges | Jz | Juízes |
RUT | Ruth | Rt | Rute |
1SA | 1 Samuel | 1 Sm | 1 Samuel |
2SA | 2 Samuel | 2 Sm | 2 Samuel |
1KI | 1 Kings | 1 Rs | 1 Reis |
2KI | 2 Kings | 2 Rs | 2 Reis |
1CH | 1 Chronicles | 1 Cr | 1 Crônicas |
2CH | 2 Chronicles | 2 Cr | 2 Crônicas |
EZR | Ezra | Ed | Esdras |
NEH | Nehemiah | Ne | Neemias |
EST | Esther | Es | Ester |
JOB | Job | Jó | Jó |
PSA | Psamls | Sl | Salmos |
PRO | Proverbs | Pv | Provérbios |
ECC | Ecclesiastes | Ec | Eclesiastes |
SNG | Song of Songs | Ct | Cânticos, Cântico dos Cânticos |
ISA | Isaiah | Is | Isaías |
JER | Jeremiah | Jr | Jeremias |
LAM | Lamentations | Lm | Lamentações |
EZK | Ezekiel | Ez | Ezequiel |
DAN | Daniel | Dn | Daniel |
HOS | Hosea | Os | Oséias |
JOL | Joel | Jl | Joel |
AMO | Amos | Am | Amós |
OBA | Obadiah | Ob | Obadias |
JON | Jonah | Jn | Jonas |
MIC | Micah | Mq | Miquéias |
NAM | Nahum | Nm | Naum |
HAB | Habakkuk | Hb | Habacuque |
ZEP | Zephaniah | Sf | Sofonias |
HAB | Haggai | Ag | Ageu |
ZEC | Zechariah | Zc | Zacarias |
MAL | Malachi | Ml | Malaquias |
MAT | Matthew | Mt | Mateus |
MRK | Mark | Mc | Marcos |
LUK | Luke | Lc | Lucas |
JHN | John | Jo | João |
ACT | Acts | At | Atos dos Apóstolos, Atos |
ROM | Romans | Rm | Romanos |
1CO | 1 Corinthians | 1 Co | 1 Coríntios |
2CO | 2 Corinthians | 2 Co | 2 Coríntios |
GAL | Galatians | Gl | Gálatas |
EPH | Ephesians | Ef | Efésios |
PHP | Philippians | Fp | Filipenses |
COL | Colossians | Cl | Colossenses |
1TH | 1 Thessalonians | 1 Ts | 1 Tessalonicenses |
2TH | 2 Thessalonians | 2 Ts | 2 Tessalonicenses |
1TI | 1 Timothy | 1 Tm | 1 Timóteo |
2TI | 2 Timothy | 2 Tm | 2 Timóteo |
TIT | Titus | Tt | Tito |
PHM | Philemon | Fm | Filemom |
HEB | Hebrews | Hb | Hebreus |
JAS | James | Tg | Tiago |
1PE | 1 Peter | 1 Pe | 1 Pedro |
2PE | 2 Peter | 2 Pe | 2 Pedro |
1JN | 1 John | 1 Jo | 1 João |
2JN | 2 John | 2 Jo | 2 João |
3JN | 3 John | 3 Jo | 3 João |
JUD | Jude | Jd | Judas |
REV | Revelation | Ap | Apocalipse |
For example, the following book identifiers start with the letter A: Ag or Ageu (Haggai); Am or Amós (Amos); Ap or Apocalipse (Revelations); At, Atos, or Atos dos Apóstolos (Acts of the Apostles).
.
└───A
├───g
│ └───eu
├───m
│ └───ós
├───p
│ └───ocalipse
└───t
└───os
└─── dos apóstolos
Important to note that while those book identifiers should match, partial names should not. In the previous example, the letter A does not reference any book. Both cases need to be tested.
There is a special case of numbered books, that is, book identifiers beginning with a number, like 1 Samuel and 3 João. It is common to use either Indo Arabic numerals or Roman numerals.
The following Regular Expression is the mininum solution for this problem:
/(?:(?:(?:1|2|I|II)\s?(?:C(?:o(?:ríntios)?|r(?:ônicas)?)|Jo(?:ão)?|Pe(?:dro)?|R(?:eis|s)|S(?:amuel|m)|T(?:essalonicenses|imóteo|m|s)))|(?:(?:3|III)\s?Jo(?:ão)?)|A(?:g(?:eu)?|m(?:ós)?|p(?:ocalipse)?|t(?:os(?:\sdos\sApóstolos)?)?)|(?:C(?:antares|l|olossenses|t|ântico(?:s|\sdos\sCânticos)?))|(?:D(?:aniel|euteronômio|n|t))|(?:E(?:c(?:lesiastes)?|d|f(?:ésios)?|sdras|ster|t|z(?:equiel)?))|(?:F(?:i(?:lipenses|lemom)|m|p))|(?:G(?:l|n|álatas|ênesis))|(?:H(?:a(?:bacuque)?|b|ebreus))|(?:Is(?:aías)?)|(?:J(?:d|eremias|l|n|o(?:el|nas|sué|ão)?|r|s|u(?:das|ízes)|z|ó))|(?:L(?:amentações|c|evítico|m|ucas|v))|(?:M(?:a(?:laquias|rcos|teus)|c|iquéias|l|q|t))|(?:N(?:a(?:um)?|e(?:emias)?|m|úmeros))|(?:O(?:b(?:adias)?|s(?:éias)?))|(?:P(?:rovérbios|v))|(?:R(?:m|omanos|t|ute))|(?:S(?:almos|f|l|ofonias))|(?:T(?:g|i(?:to|ago)|t))|(?:Z(?:acarias|c))|(?:Êx(?:odo)?))/
Matching numberings
The second part regarding numberings is more complex.
Form | Description | Use | Example |
---|---|---|---|
A | Positive integer | Chapter or verse | 1 |
B | Two positive integers separated by dash | Verse interval | 1-2 |
C | Two positive integers separated by colon or point | Verse | 1:2 or 1.2 |
D | Form C extended by positive integer. | Verse interval within a chapter | 1:2-4 or 1.2-4 |
E | Form A extended by positive integer. | Chapter interval | 1—2 |
F | Form A extended by positive integer. | Verse interval crossing chapters | 1—2:3 |
G | Form A extended by positive integer. | 1:2—2 | |
H | Form A extended by positive integer. | 1:2—2:3 |
The following Regular Expression is the mininum solution for this problem:
(\d+)(?::(\d+))?(?:(?:-(\d+))|(?:?:(--|–|—)(?:(\d+):)?(\d+)))?
Conclusion
Regular expressions like this have inumerous applications. They could be used to build plugins to display biblical content based on the reference automatically. There are actually some products such as RefTagger. But so far, there is no such tool supporting Portuguese.