Requirements Analysis: Parsing and XHTML compilation of SPC section 4.8 Undesirable Effects

This is an analysis of the structure of SPC section 4.8 (UNDESIRABLE EFFECTS) based on the actual contents of real SPC:s. The purpose of this analysis is to understand and implement algorithms for extracting information from section 4.8 and index that information in an optimal way.

The current main reason for implementing this is to be able to compile nice XHTML versions of the tables on undesirable effects that can be found in section 4.8.

About SPC Section 4.8

The SPC section 4.8 has the title UNDESIRABLE EFFECTS and lists known undesirable side effects for the drug in question. Undesirable (side) effects are often called  Adverse Drug Reactions (ADR). We will use that acrynom henceforth.

Some of these ADRs have been observed in clinical tests as part of the pre-marketing and pre-approval clinical testing performed by the drug manufacturer. These ADRs are those orginally listed in the fist approved and registered SPC. However, after a drug has been more widely used new ADRs may be observed. These new ADRs must be reported and the SPC for the drug in question must be updated. It is the responsablity of EMEA and national drug approval agencies to analyze ADR reports and see if they are already known ADRs or if they are new and thus requires the SPC:s in question to be updated.

The text structure in section 4.8 is not formalized but it does have a sort of semi formal structure and uses a set of established terms. The typical layout is like this:

4.8 UNDESIRABLE EFFECTS

<A few paragraphs of text describing the general types of ADRs this drug may cause.>

<Occasionally a paragraph describing how the frequency of ADRs are classified, ie. '''very rare''', '''rare''' and '''common'''.>

<A table listing a) System organ class of the ADR, b) Frequency class and c) MDR description>

<A few paragraphs commenting on the table or on other aspects of undisirabel effects.>

The above layout is representative. In some cases the listing of undisirable effects may be embedded in free text paragraphs which make it nearly impossible to extract a tabular representation of the undisirable effects. The System organ class is a term from the  MedDRA Terminology. In some cases non standard MedDRA terms or alternative system organ class terms are used.

Some 4.8 sections have two tables. There are between 15-20 identified text layout patterns.

General implementation notes

We begin by trying to analyze the SPC section 4.8 text stored in the index file. We have assumed that there exist a limited (<20) text layout patterns for the 4.8 sections. Basic algorithm:

  1. Script iterates over all index files (.py) in a directory and for each index file (.py) checks if it is a spc type index file.
  1. Run one pass over the section 4.8 text:

2.1. Parse text into paragraphs, removing unneccesary whitespace characters (new line, tab).

2.2. Identify the start of table layouts.

2.3. For each identified table layout, invoke a table parser (function or module) that returns a fully parsed table.

  1. Generate a Python data dictionary that represents the 4.8 section in the form of a list of paragraphs (Python strings) and tables (Python dictionaries).
  1. Store the data dictionary in the index file at [ 'entries' ] [ '4.8' ] [ 'data' ].

Since there is no formal specification for the text layout in section 4.8 and since there was no a-priori information on the text layout we needed to apply a heuristic approach for identifying and parsing the different layouts. By browsing a few section 4.8 texts, four basic patterns where identified. Code for identifying these where implemented in a small script which also reported on all text layouts that couldn't be identified. These were the analyzed for new patterns and code was iteratively added to the script in order to identify the new patters. Eventually the script could handle ~99% of the files.

Implementation notes: Parsing

We implement a Landmark/keyword based parser for the following semi-formal grammar:

section_48 => [paragraph|table]+
paragraph => [line'\w']+empty_line+
table => (table_layout_1|table_layout_2|...|table_layout_n)empty_line+
empty_line => '\n'
table_layout_1 => "Table 1: " ...
table_layout_2 => "Table 1"'\n' ...
...
table_layout_n => "MedDRA System Organ Class" ...

Where [] denotes a list, '\w' denotes a whitespace character (' ', '\n', '\t') and '\n' denotes a newline character. Anything within double quotes (") is interpreted literally. Note that we don't define the grammar for the individual table layouts here!

It is important that it is easy to add procedures for parsing new table layouts. This is achieved by implementing the parse procedures for table layouts as separate Python modules with the following interface:

def name (): 
        """Name of the semi structured text specimens that this parser can parse."""

specimen_identified (lines):
        """Is a table header identified the first line of 'lines'?"""

def parsed_specimen (lines):
        """Parse 'lines' and and return a tuple with the parsed data and the number
           of lines consumed"""

The table layout parser modules (TLPM) all must implement the above interface and are located in a dedicated directory which is parsed by the main parser procedur upon invocation.

Specification: Extracted data structure

The data extracted from the section 4.8 text shall have the following structure:

[{'type': 'paragraph', 'data': 'Paragraph 1 ...', 'parser': <PARSER_NAME>, 'script': <PARSER_SCRIPT>},
 {'type': 'paragraph', 'data': 'Paragraph 2 ...', 'parser': <PARSER_NAME>, 'script': <PARSER_SCRIPT>},
 ...,
 {'type': 'table', 'data': [['Table 1 row 1 & column 1', 'Table 1 row 1 & column 2', ..., 'Table 1 row 1 & column Cn'],
  ['Table 1 row 2 & column 1', 'Table 1 row 2 & column 2', ..., 'Table 1 row 2 & column Cn']], 'parser': <PARSER_NAME>, 'script': <PARSER_SCRIPT>},
 {'type': 'paragraph', 'data': 'Paragraph N ...', 'parser': <PARSER_NAME>, 'script': <PARSER_SCRIPT>},
 ...,
 {'type': 'table', 'data': [['Table 2 row 1 & column 1', 'Table 1 row 1 & column 2', ..., 'Table 1 row 1 & column Cn'],
  ['Table 2 row 2 & column 1', 'Table 2 row 2 & column 2', ..., 'Table 2 row 2 & column Cn']], 'parser': <PARSER_NAME>, 'script': <PARSER_SCRIPT>},
 {'type': 'paragraph', 'data': 'Paragraph M ...', 'parser': <PARSER_NAME>, 'script': <PARSER_SCRIPT>},
 ...,
]

Where <PARSER_NAME> is a (descriptive) name of the parser that identified the 'table' or 'paragraph' and <PARSER_SCRIPT> is the file name of the parser module.

Implementation notes: Compiling to XHTML

Once we have a nice Python dictionary representation of section 4.8 we can use that to generate a XHTML representation that looks nice on the web. The basic algorithm is:

For a given index file (.py):

  1. Read the data dictionary in the index file at [ 'entries' ] [ '4.8' ] [ 'data' ].
  1. Generate a XHTML element <div class="spc_section">.
  1. Iterate over the data dictionary and fill the above element with <div class="paragraph"> and <div class="table"> elements based on the contents of each data dictionary item.
  1. Store the XHTML representation in the index file at [ 'entries' ] [ '4.8' ] [ 'div_expanded' ].

We also have a CSS stylesheet specifically designed for XHTML based presentation of SPC contents called spc.css, which is used to style the contents of the generated XHTML.

The script 'spc_section_48_compile.py'

This script spc_section_48_compile.py can a) identify the start of a number of different table layouts in SPC section 4.8, b) generate Python dictionary representations of SPC section 4.8 for (currently) one table layout, c) generate XHTML code based on the Python dictionary representation. It also updates the indes files both with the Python dictionary representation and with the XHTML representation. The script takes a number of command line parameters (use -h/--help for more information).

The design of the script is based on a main script, 'spc_section_48_compile.py', and a set of plugin parser scripts, each dedicated to recognizing and parsing a single type of table layout. The main script parses the index files and for each index file submits the text contents of section 4.8 to a plugin parser script which a) tries to identify the start of a table and b) can parse the entire table if told to do so and return a Python dictionary representation of it.

Run the script with the -h or --help flag to see possible options:

./spc_section_48_compile.py --help

Note that the script can be used to either just analyze and report statistics on the parsed tables or to analyze and actually update (compile) the index files with data and XHTML presentation code for undesirable effects.

Here is the result of running the script '/var/lib/drugle/scripts/spc_section_48_compile.py' with the summary flag (-s) on all English (-len) SPC:s from EMEA (-semea):

/var/lib/drugle/scripts$ ./spc_section_48_compile.py -len -demea -s
Warning: Parsed only 1 lines of table. This probably means that the parser had problems parsing the table.
  Parser: 'SPC section 4.8 table beginning with the header [u'Very common', u'Common', u'Uncommon', u'Not known']'
  Product: 'Olanzapine Mylan 2.5 mg film-coated tablets'
  Index file: 'H-961-PI-en.0.py'
Warning: Parsed only 1 lines of table. This probably means that the parser had problems parsing the table.
  Parser: 'SPC section 4.8 table beginning with the header [u'Very common', u'Common', u'Uncommon', u'Not known']'
  Product: 'Olanzapine Mylan 2.5 mg film-coated tablets'
  Index file: 'emea-combined-h961en.0.py'
Index file glob patterns: ['/var/lib/drugle/sources/emea/spc/index/en/*.py']
Total number of index files: 1691
Total number of index files with missing section 4.8: 5
Total number of index files with unrecognized table layouts in section 4.8: 5
Total number of index files with identified layouts (tables and unstructured): 1686 [99.70%]
  Number of known unstructured table layouts: 162
  Total number of recognized table layouts: 2803 (1279 layouts were identified by more than 1 parser.)
    spc-en-section-48-table (prio: 2):                       2 (0 parsed) (SPC section 4.8 table beginning with 'TABLE')
    spc-en-section-48-table-1 (prio: 2):                   196 (0 parsed) (SPC section 4.8 table beginning with 'Table 1' not followed by a colon)
    spc-en-section-48-table-colon-N (prio: 2):             158 (0 parsed) (SPC section 4.8 table beginning with 'Table N:' where N is a number [1..4])
    spc-en-section-48-table-header-1 (prio: 2):              7 (7 parsed) (SPC section 4.8 table beginning with the header [u'System Organ', u'Very', u'Common', u'Uncommon', u'Rare'])
    spc-en-section-48-table-header-10 (prio: 2):             2 (0 parsed) (SPC section 4.8 table beginning with the line: '^[ ]*Very Common \(> 1/10\), Common \(> 1/100, < 1/10\), Uncommon \(> 1/1,000, < 1/100\)')
    spc-en-section-48-table-header-11 (prio: 2):            24 (0 parsed) (SPC section 4.8 table beginning with the header [u'System organ class', u'Frequency', u'Adverse reaction'])
    spc-en-section-48-table-header-12 (prio: 2):             5 (0 parsed) (SPC section 4.8 table beginning with the header [u'System Organ', u'Very Common', u'Common', u'Uncommon'])
    spc-en-section-48-table-header-13 (prio: 2):             1 (0 parsed) (SPC section 4.8 table beginning with the header [u'System organ class', u'Adverse reaction/event'])
    spc-en-section-48-table-header-14 (prio: 2):             3 (0 parsed) (SPC section 4.8 table beginning with the header [u'System organ class', u'Common', u'Uncommon'])
    spc-en-section-48-table-header-15 (prio: 2):             5 (0 parsed) (SPC section 4.8 table beginning with the header [u'System organ class', u'Adverse drug reactions'])
    spc-en-section-48-table-header-16 (prio: 2):             5 (0 parsed) (SPC section 4.8 table beginning with the header [u'System organ class', u'Adverse reaction', u'Frequency'])
    spc-en-section-48-table-header-17 (prio: 2):             2 (0 parsed) (SPC section 4.8 table beginning with the header [u'System organ class', u'\u2265 1/10', u'\u2265 1/100, < 1/10', u'\u2265 1/1,000, < 1/100'])
    spc-en-section-48-table-header-18 (prio: 2):            74 (72 parsed) (SPC section 4.8 table beginning with the header [u'Very common', u'Common', u'Uncommon', u'Not known'])
    spc-en-section-48-table-header-2 (prio: 2):             56 (0 parsed) (SPC section 4.8 table beginning with the header [u'MedDRA system organ class', u'Subject Incidence', u'Adverse Drug Reaction'])
    spc-en-section-48-table-header-3 (prio: 2):              5 (0 parsed) (SPC section 4.8 table beginning with the header [u'Very common', u'Common', u'Uncommon', u'Rare', u'Not known'])
    spc-en-section-48-table-header-4 (prio: 2):              4 (0 parsed) (SPC section 4.8 table beginning with the header [u'Undesirable Effects in Clinical Studies and Post-marketing in Adult Patients'])
    spc-en-section-48-table-header-5 (prio: 2):             15 (0 parsed) (SPC section 4.8 table beginning with the header [u'System Organ Class', u'*Common', u'*Uncommon', u'*Rare'])
    spc-en-section-48-table-header-6 (prio: 2):              4 (0 parsed) (SPC section 4.8 table beginning with the header [u'Body System', u'Very Common', u'Common', u'Uncommon'])
    spc-en-section-48-table-header-7 (prio: 2):              3 (0 parsed) (SPC section 4.8 table beginning with the header [u'System Organ Class', u'Very Common', u'Common'])
    spc-en-section-48-table-header-8 (prio: 2):             12 (0 parsed) (SPC section 4.8 table beginning with the header [u'System Organ', u'Very', u'Common', u'Uncommon', u'Rare', u'Not Known'])
    spc-en-section-48-table-header-9 (prio: 2):             14 (0 parsed) (SPC section 4.8 table beginning with the header [u'System Organ', u'Common', u'Uncommon', u'Rare', u'Very rare'])
    spc-en-section-48-table-soc (prio: 1):                  46 (0 parsed) (SPC section 4.8 table beginning with 'System organ class')
    spc-en-section-48-table-soc-term (prio: 1):           1027 (0 parsed) (SPC section 4.8 table beginning with a line that begins with a MedDRA system organ class followed by other terms)
    spc-en-section-48-table-soc-term-colon (prio: 1):      250 (0 parsed) (SPC section 4.8 table beginning with a line that begins with a MedDRA system organ class followed by a colon)
    spc-en-section-48-table-soc-term-only (prio: 1):       883 (0 parsed) (SPC section 4.8 table beginning with a line that only contains a MedDRA system organ class)
    spc-sv-section-48-table-1 (prio: 2):                     0 (0 parsed) (Swedish SPC section 4.8 table beginning with 'Tabell 1.')
    spc-sv-section-48-table-1-behandlingsregim (prio: 3):    0 (0 parsed) (Swedish SPC section 4.8 table beginning with 'Tabell 1.' and with columns per 'behandlingsregim')
    spc-sv-section-48-table-1-dosering (prio: 3):            0 (0 parsed) (Swedish SPC section 4.8 table beginning with 'Tabell 1.' and with columns per 'doseringsstrategi')

Some comments:

  1. It is possible to achieve a success rate > 99% in identifying table layouts.
  1. There are 25 table layout patterns that currently can be identified in the existing English SPC:s from EMEA. Of these 79 (72 + 7) can actually be parsed.
  1. Some 4.8 sections contain two or more tables of undesirable effects. The statistics reporting may be flawed!
  1. A large number of table layouts can only be identified if we have a term database of MedDRA System Organ classes (SECTION_LAYOUT_ONLY_MEDDRA_SOC_START, SECTION_LAYOUT_MEDDRA_SOC_COLON_START, SECTION_LAYOUT_MEDDRA_SOC_START).

Instructions for updating a Drugle installation

1. [Optional] Run:

/var/lib/drugle/scripts$ ./spc_section_48_compile.py -len -demea -s

To check that the script runs properly and reasonable statistics are reported.

2. [Optional] Run:

/var/lib/drugle/scripts$ ./spc_section_48_compile.py -len -demea -s -rspc-en-section-48-table-header-1 -i*

To report info on all products (-i*) that have tables identified by the 'spc-en-section-48-table-header-1' parser (-rspc-en-section-48-table-header-1).

3. [Optional] Run:

/var/lib/drugle/scripts$ ./spc_section_48_compile.py -len -demea -s -rspc-en-section-48-table-header-18 -i*

To report info on all products (-i*) that have tables identified by the 'spc-en-section-48-table-header-18-1' parser (-rspc-en-section-48-table-header-18).

4. [Required] Run

/var/lib/drugle/scripts$ ./spc_section_48_compile.py -len -demea -c

To compile all index files that have tables that have been identified and parsed - and thus can be compiled.