The BoaC Programming Guide - Domain-Specific Functions

BoaC provides several useful built-in functions designed to ease common analysis on the COVID-19 research papers. For reference, these functions are described in this section.

Noise Removal Functions

is_ascii (token: string) : bool

Whether a token can be encoded in US-ASCII.

no_nascii (tokens: array of string) : array of string

Filter an array of tokens in US-ASCII.

Searching Functions

find_paras (paper: Paper) : array of Paragraph

Find an array of paragraphs contain the keywords.

find_stens (paper: Paper) : array of string

Find an array of sentences contain the keywords.

search_keywords (text: string, keywords: string...) : bool

Whether a text contains the keywords

Stemming Functions

stem (word: string) : string

Stem a word.

stem (words: array of string) : array of string

Stem an array of words.

Stop Word Functions

stop_words () : set of string

Return a set of string containing English language stop words.

update_stop_words (stopword: string) : set of string

Update the English stop words with the input.

Tokenizer Functions

get_tokens (text: string) : array of string

Lowercase a text and split it by spaces.

get_tokens (text: string, filter: set of string) : array of string

Get tokens from a text with a filter set (exclusion).

tokenize (text: string) : array of string

Tokenize a text with OpenNLP's SimpleTokenizer.