Biomedical Text Mining

隨著資訊技術的蓬勃發展,資訊技術也被導入到各項領域的研究之中。由於生醫領域學門眾多、發展快速,內容十分繁雜,因此美國國家醫學圖書館 National Library of Medicine(NLM)早於 1960 年代即著手生醫文獻資訊化,建構出以收集生醫文獻為主的 Medline 資料庫與 PubMed 搜尋引擎供研究者使用。然而,現階段生醫研究者為設計實驗而搜尋資訊時,仍然得從大量未經結構化整理的文獻中,找尋其研究目標與方向,所以我們生醫文獻探勘組,繼續致力於為生醫領域開發更多便捷系統,縮短研究者們繁瑣的搜尋文獻程序,希望能加速整體生醫領域的相關研究發展。 

生醫文獻至少有以下主要特性:

  • 新創的命名實體繁多:在生物醫學文獻中,無論是基因名稱,蛋白質名稱,細胞名稱或是藥物名稱,皆是命名實體的一種,更是扮演者生醫研究文獻中重要的基本角色。

  • 命名實體縮寫沒有命名規則且具有多變性:文獻的命名實體經常由於過於冗長,研究者常採用縮寫的方式來代表命名實體,例如 interleukin 2 可縮寫為 IL-2,或是 p53 常是 protein 53、p53 protein、protein-53等縮寫。

  • 一個命名實體可能代表多個實體。

  • 文句可能出現複雜巢狀結構。

  • 動詞語意會具有強烈的生物獨特性,例如: active、induce等。

想像一下,我們可能會有很多稱呼,像是本名或外號,但每個人都會一個獨一無二的身分證號碼,而文獻中的基因名也是如此,所以如何將這些命名實體找出來,並將其關聯至所代表的資料庫識別號碼便是一個很重要的技術 ! 我們將藉由以下元件,更進一步的精進許多生醫文獻探勘的技術,致力開發完善的文獻前處理器,縮短生物醫學的研究者搜尋文獻時間,使學者快速找到其有興趣的文獻。 

Our team devotes to develop technics of biomedical text mining and try new tasks in bioNLP. Include :

  • Named Entity Recognition (NER):

              Locate and classify the name entity mentions in biomedical literatures.

             Name entities include genes, proteins, disease, etc.

  • Named Entity Normalization:

              Map the name entity in the biomedical literatures to the corresponding

             biomedical database identifiers.

  • Relation Extraction:

              Extract and classify the relation(ship) between the name entity in biomedical literatures,

             such as disease-disease association, protein-protein interaction, etc.

  • Question Answering (QA):

             After Stanford Question Answering Dataset (SQuAD) is successful as a

             QA benchmark in general domain, build a biomedical QA dataset and

             system become popular.

             Biomedical Semantic Indexing and Question Answering (BioASQ) and

             Google’s PubMed QA provide their public leaderboard. The source of

             both of them use PubMed abstracts instead of Wikipedia.

主要開發成果與競賽:

主要開發成果與競賽:

Disease-Disease Association Extraction (DDAE)

We formulate the DDAE as a supervised machine learning classification problem. Given a

sentence containing a disease pair, our system can classify the pair into one of several pre-

defined disease association categories. We annotated 3 DDA types: Positive, Negative and

Null.

Figure of DDAE examples

AI CUP 2019 Biomedical literatures automatic analysis competition - Protein Protein Interaction

This task is about analysis of clinical medical records based on artificial intelligence

technology. A better understanding of the processes and technologies for analyzing

biomedical data. Enabling participants to understand the use of natural language

processing to connect basic medicine, bioinformatics research, and clinical treatment.

Illustration of fields of the competition

AI CUP 2018

Biomedical literatures automatic analysis warm-up competition

The competition focus on natural language processing technology. The open corpus allows

students to train professions in artificial intelligence, machine learning, natural language

processing, and ethics.

There are three stages of competition:

  • Elementary stageParticipants are asked to identify the biomedical entity with name entity recognition (including chemical, disease, gene).

  • Intermediate stage:Participants are asked to identify the biomedical entity with name entity recognition (including chemical, disease, gene) and indicate the corresponding database identification number. There are some non-human gene, but most of the gene identification number are human gene.

  • Advanced stage:Participants are asked to identify the biomedical entity with name entity recognition (including chemical, disease, gene) and indicate the corresponding database identification number. Participants were also asked to answer the question of which organ or tissue the disease originated from (there are 57 organs and tissues, like lungs, skin, blood or bone marrow). Participants also need to answer whether the chemical entity in the article will cause any disease entity in the articles or not.

NERChem 

本團隊針對專利文件辨識化合物與藥物名稱,提出將原子與化合物分開來辨識之方式,並利用事先辨識易與化合物混淆的專有名詞來提升化合物辨識的正確率,在專利文獻化合物辨識競賽得到第四名。

T-HOD Database

此資料庫利用文字探勘的技術來收集和高血壓(Hypertension)、肥胖(Obesity)、糖尿病(Diabetes)有關的候選基因,並依照權重給予適當的排序,並且利用視覺化方式呈現文獻搜尋結果。

BelSmile  

本系統整合近幾年實驗室開發的基因與化合物專有名詞辨識正規化技術,並結合實驗室開發的生醫語意角色標註技術,發展出可自動化擷取學術文獻中各類專有名詞與它們的生物關連性,在生物性表現語言競賽中榮獲第二名。

Biomedical Semantic Role Labeling Website

此系統能自動將以名詞或動詞為中心的語意框架(Semantic Frame) 解析出來,其中語意框架主要由述語(predict)、主語(agent)、賓語(patient)及其他形容事件    的片語所組成,如時間(time)、地點(location)等等。

PubMed-EX 工具

一種非常好用的瀏覽器附加元件, 可以幫助閱讀PubMed資料庫收藏的文獻. 安裝完PubMed-EX, PubMed搜尋到的論文標題與摘要中出現的生醫專有名詞會以不同顏色顯示, 這些專有名詞也會被標上超連結, 導引使用者到資料庫閱讀詳細資訊. 此外, 摘要中重要的語意框架會被列出, 摘要也會被自動分段. 目前已有數十個不同國家的學者持續使用中。

Gene Mention/Normalization Tool

基因名稱辨識與基因編號搜尋工具

BIOSMILE Web Search

生醫文獻搜尋引擎. 搜尋到的論文標題與摘要中出現的生醫專有名詞會以不同顏色顯示, 這些專有名詞也會被標上超連結, 導引使用者到資料庫閱讀詳細資訊. 此外, 摘要中重要的語意框架會被列出。

©2018 by IISR. Proudly created with Wix.com