Me The Works of Wellyson Freitas

Regular Expressions
for Bible References

Regular Expression, often shortened as "regex", is a powerful tool used for pattern matching in text. In simple terms, it is a sequence of characters that define a search pattern. For this, a special syntax it is used, making the work sometimes quite complex.

A good exercise with Regular Expression is to find a mininum solution that define the pattern of the sofisticated reference system for the Bible. This has many applications such as automatic set of tags to the references of a given text. The goal of this article is to support the Brazilian Portuguese referencing.

First, a general form was defined. Then, a mininum set of test cases was enumerated using testing techniques of Boundary Value Analysis and Equivalence Partitioning. The problem was divided in two: (1) book pattern and (2) numbering pattern. The solution for each of them was developed following Test-Driven Development methodology. Finally, the final Regular Expression was build, tested, and analysed for improvements.

Development used JavaScript on top of Node.js runtime environment using the Jest framework for testing. These technologies were choosen for simplicity.

Definition of general form

The Bible is an ancient collection of religious documents that were aggregated in books. Biblical chapters are introduced by Sthepen Langton in 1227. Ultimately, verses are introduced by Robert Stephanus in 1551. Biblical references are based on these levels to locate parts of the holy text.

These parts could be understood as intervals of the smallest level, i.e., intervals of verses. Even sole verses in this case would be degenerate intervals. The concept of intervals helps to define a general form for references.

Biblical references are just a set of intervals:

INTERVAL [, INTERVAL...]

An interval is compound of left and right endpoints, which coincides in the case of degenerate interval. Endpoints references a location of units at the smallest level, i.e., verses. When information for verse or chapter is missing, the first ones are considered.

INTERVAL := [LEFT_UNIT, RIGHT_UNIT]

To locate a unit, three pieces of information are required: book, chapter, and verse. There are some cases where chapter information is omitted when there is only one.

Matching Bible books in Portuguese

The 66 books of the Protestant Bible were considered for this work. This is because the most read Bible version in Brazil is the protestant translation by João Ferreira de Almeida (1618-1701), although others books such as the Deuterocanonical books could be also added.

Each book has an identifier. Portuguese identifiers were collected from the Biblical Society of Brazil, owner of Almeida's translation. Books are frequently identified as shown in Table 1.

Table 1. Portuguese identifiers for Biblical books

USFM EN title PT abbr. PT titles
GEN Genesis Gn Gênesis
EXO Exodus Êx Êxodo
LEV Leviticus Lv Levítico
NUM Numbers Nm Números
DEU Deuteronomy Dt Deuteronômio
JOS Joshua JsJosué
JDG Judges JzJuízes
RUT Ruth RtRute
1SA 1 Samuel 1 Sm1 Samuel
2SA 2 Samuel 2 Sm2 Samuel
1KI 1 Kings 1 Rs 1 Reis
2KI 2 Kings 2 Rs 2 Reis
1CH 1 Chronicles 1 Cr 1 Crônicas
2CH 2 Chronicles 2 Cr 2 Crônicas
EZR Ezra Ed Esdras
NEH Nehemiah Ne Neemias
EST Esther Es Ester
JOB Job
PSA Psamls Sl Salmos
PRO Proverbs Pv Provérbios
ECC Ecclesiastes Ec Eclesiastes
SNG Song of Songs Ct Cânticos, Cântico dos Cânticos
ISA Isaiah Is Isaías
JER Jeremiah Jr Jeremias
LAM Lamentations Lm Lamentações
EZK Ezekiel Ez Ezequiel
DAN Daniel Dn Daniel
HOS Hosea Os Oséias
JOL Joel Jl Joel
AMO Amos Am Amós
OBA Obadiah Ob Obadias
JON Jonah Jn Jonas
MIC Micah Mq Miquéias
NAM Nahum Nm Naum
HAB Habakkuk Hb Habacuque
ZEP Zephaniah Sf Sofonias
HAB Haggai Ag Ageu
ZEC Zechariah Zc Zacarias
MAL Malachi Ml Malaquias
MAT Matthew Mt Mateus
MRK Mark Mc Marcos
LUK Luke Lc Lucas
JHN John Jo João
ACT Acts At Atos dos Apóstolos, Atos
ROM Romans Rm Romanos
1CO 1 Corinthians 1 Co 1 Coríntios
2CO 2 Corinthians 2 Co 2 Coríntios
GAL Galatians Gl Gálatas
EPH Ephesians Ef Efésios
PHP Philippians Fp Filipenses
COL Colossians Cl Colossenses
1TH 1 Thessalonians 1 Ts 1 Tessalonicenses
2TH 2 Thessalonians 2 Ts 2 Tessalonicenses
1TI 1 Timothy 1 Tm 1 Timóteo
2TI 2 Timothy 2 Tm 2 Timóteo
TIT Titus Tt Tito
PHM Philemon Fm Filemom
HEB Hebrews Hb Hebreus
JAS James Tg Tiago
1PE 1 Peter 1 Pe 1 Pedro
2PE 2 Peter 2 Pe 2 Pedro
1JN 1 John 1 Jo 1 João
2JN 2 John 2 Jo 2 João
3JN 3 John 3 Jo 3 João
JUD Jude Jd Judas
REV Revelation Ap Apocalipse

For example, the following book identifiers start with the letter A: Ag or Ageu (Haggai); Am or Amós (Amos); Ap or Apocalipse (Revelations); At, Atos, or Atos dos Apóstolos (Acts of the Apostles).

. └───A ├───g │ └───eu ├───m │ └───ós ├───p │ └───ocalipse └───t └───os └─── dos apóstolos

Important to note that while those book identifiers should match, partial names should not. In the previous example, the letter A does not reference any book. Both cases need to be tested.

There is a special case of numbered books, that is, book identifiers beginning with a number, like 1 Samuel and 3 João. It is common to use either Indo Arabic numerals or Roman numerals.

The following Regular Expression is the mininum solution for this problem:

/(?:(?:(?:1|2|I|II)\s?(?:C(?:o(?:ríntios)?|r(?:ônicas)?)|Jo(?:ão)?|Pe(?:dro)?|R(?:eis|s)|S(?:amuel|m)|T(?:essalonicenses|imóteo|m|s)))|(?:(?:3|III)\s?Jo(?:ão)?)|A(?:g(?:eu)?|m(?:ós)?|p(?:ocalipse)?|t(?:os(?:\sdos\sApóstolos)?)?)|(?:C(?:antares|l|olossenses|t|ântico(?:s|\sdos\sCânticos)?))|(?:D(?:aniel|euteronômio|n|t))|(?:E(?:c(?:lesiastes)?|d|f(?:ésios)?|sdras|ster|t|z(?:equiel)?))|(?:F(?:i(?:lipenses|lemom)|m|p))|(?:G(?:l|n|álatas|ênesis))|(?:H(?:a(?:bacuque)?|b|ebreus))|(?:Is(?:aías)?)|(?:J(?:d|eremias|l|n|o(?:el|nas|sué|ão)?|r|s|u(?:das|ízes)|z|ó))|(?:L(?:amentações|c|evítico|m|ucas|v))|(?:M(?:a(?:laquias|rcos|teus)|c|iquéias|l|q|t))|(?:N(?:a(?:um)?|e(?:emias)?|m|úmeros))|(?:O(?:b(?:adias)?|s(?:éias)?))|(?:P(?:rovérbios|v))|(?:R(?:m|omanos|t|ute))|(?:S(?:almos|f|l|ofonias))|(?:T(?:g|i(?:to|ago)|t))|(?:Z(?:acarias|c))|(?:Êx(?:odo)?))/

Matching numberings

The second part regarding numberings is more complex.

Form Description Use Example
A Positive integer Chapter or verse 1
B Two positive integers separated by dash Verse interval 1-2
C Two positive integers separated by colon or point Verse 1:2 or 1.2
D Form C extended by positive integer. Verse interval within a chapter 1:2-4 or 1.2-4
E Form A extended by positive integer. Chapter interval 1—2
F Form A extended by positive integer. Verse interval crossing chapters 1—2:3
G Form A extended by positive integer. 1:2—2
H Form A extended by positive integer. 1:2—2:3

The following Regular Expression is the mininum solution for this problem:

(\d+)(?::(\d+))?(?:(?:-(\d+))|(?:?:(--|–|—)(?:(\d+):)?(\d+)))?

Conclusion

Regular expressions like this have inumerous applications. They could be used to build plugins to display biblical content based on the reference automatically. There are actually some products such as RefTagger. But so far, there is no such tool supporting Portuguese.