Rigor and Transparency Index, a new metric of quality for assessing biological and medical science methods | bioRxiv

Abstract:  The reproducibility crisis in science is a multifaceted problem involving practices and incentives, both in the laboratory and in publication. Fortunately, some of the root causes are known and can be addressed by scientists and authors alike. After careful consideration of the available literature, the National Institutes of Health identified several key problems with the way that scientists conduct and report their research and introduced guidelines to improve the rigor and reproducibility of pre-clinical studies. Many journals have implemented policies addressing these same criteria. We currently have, however, no comprehensive data on how these guidelines are impacting the reporting of research. Using SciScore, an automated tool developed to review the methods sections of manuscripts for the presence of criteria associated with the NIH and other reporting guidelines, e.g., ARRIVE, RRIDs, we have analyzed ~1.6 million PubMed Central papers to determine the degree to which articles were addressing these criteria. The tool scores each paper on a ten point scale identifying sentences that are associated with compliance with criteria associated with increased rigor (5 pts) and those associated with key resource identification and authentication (5 pts). From these data, we have built the Rigor and Transparency Index, which is the average score for analyzed papers in a particular journal. Our analyses show that the average score over all journals has increased since 1997, but remains below five, indicating that less than half of the rigor and reproducibility criteria are routinely addressed by authors. To analyze the data further, we examined the prevalence of individual criteria across the literature, e.g., the reporting of a subject’s sex (21-37% of studies between 1997 and 2019), the inclusion of sample size calculations (2-10%), whether the study addressed blinding (3-9%), or the identifiability of key biological resources such as antibodies (11-43%), transgenic organisms (14-22%), and cell lines (33-39%). The greatest increase in prevalence for rigor criteria was seen in the use of randomization of subjects (10-30%), while software tool identifiability improved the most among key resource types (42-87%). We further analyzed individual journals over time that had implemented specific author guidelines covering rigor criteria, and found that in some journals, they had a big impact, whereas in others they did not. We speculate that unless they are enforced, author guidelines alone do little to improve the number of criteria addressed by authors. Our Rigor and Transparency Index did not correlate with the impact factors of journals.

 

 

Deep Learning in Mining Biological Data | SpringerLink

Abstract:  Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from different biological application domains. Categorized in three broad types (i.e. images, signals, and sequences), these data are huge in amount and complex in nature. Mining such enormous amount of data for pattern recognition is a big challenge and requires sophisticated data-intensive machine learning techniques. Artificial neural network-based learning systems are well known for their pattern recognition capabilities, and lately their deep architectures—known as deep learning (DL)—have been successfully applied to solve many complex pattern recognition problems. To investigate how DL—especially its different architectures—has contributed and been utilized in the mining of biological data pertaining to those three types, a meta-analysis has been performed and the resulting resources have been critically analysed. Focusing on the use of DL to analyse patterns in data from diverse biological domains, this work investigates different DL architectures’ applications to these data. This is followed by an exploration of available open access data sources pertaining to the three data types along with popular open-source DL tools applicable to these data. Also, comparative investigations of these tools from qualitative, quantitative, and benchmarking perspectives are provided. Finally, some open research challenges in using DL to mine biological data are outlined and a number of possible future perspectives are put forward.

 

 

Constellate

“Learn how to text mine or improve your skills using our self-guided lessons for all experience levels. Each lesson includes video instruction and your own Jupyter notebook — think of it like an executable textbook — ready to run in our Analytics Lab….

Teach text analytics to all skill levels using our library of open education resources, including lessons plans and our suite of Jupyter notebooks. Eliminate setup time by hosting your class in our Analytics Lab….

Create a ready-to-analyze dataset with point-and-click ease from over 30 million documents, including primary and secondary texts relevant to every discipline and perfect for learning text analytics or conducting original research….

Find patterns in your dataset with ready-made visualizations, or conduct more sophisticated text mining in our Analytics Lab using Jupyter notebooks configured for a range of text analytics methods….”

Computational Access and Use of Texts and Data behind Paywalls: Challenges and Resources – MIT Events

“The rise of applied data science, digital humanities, machine learning, and artificial intelligence has resulted in an increased need for computational access and reuse of research data and publications, many of which are only available behind paywalls and governed by restrictive terms of use. 

What can you do with proprietary sources, how do you gain access, and how can you make your own research output from such sources shareable are questions that many are asking. 

Join experts Katie Zimmerman, Laura Hanscom, and Ye Li from the MIT Libraries in this session to learn about the copyright and contractual implications of paywalled data sources and how you can use them and share your results….”

HTRC Awards 4 SCWAReD ACS Projects | www.hathitrust.org | HathiTrust Digital Library

“HathiTrust Research Center (HTRC) has selected four projects to participate in its special round of Advanced Collaborative Support (ACS), funded by the Andrew W. Mellon Foundation through the Scholar-Curated Worksets for Analysis, Reuse & Dissemination (SCWAReD) project.

The projects will seek to build HTRC worksets drawn from materials related to historically under-resourced and marginalized textual communities, and in doing so, to identify gaps in the HathiTrust collection where such communities are not represented in the digital library. The worksets will be analyzed using text and data mining techniques. The worksets, derived data outputs, and associated documentation will be shared at the end of the projects as illustrative research models of the text and data mining process. The four research models will join a flagship model that is being developed concurrently in collaboration with co-PI Maryemma Graham and her History of Black Writing project at the University of Kansas.

The four awarded projects are: …”

“It’s hard to explain why this is taking so long” – scilog

When it comes into force at the beginning of 2021, the Open Access initiative “Plan S” is poised to help opening up and improving academic publishing. Ulrich Pöschl, a chemist and Open Access advocate of the first hour, explains why free access to research results is important and how an up-to-date academic publishing system can work.

Authors Alliance Files Comment in Support of New Exemption to Section 1201 of the DMCA to Enable Text and Data Mining Research | Authors Alliance

“Yesterday, Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, filed a comment with the Copyright Office for a new three-year exemption to the Digital Millennium Copyright Act (“DMCA”) as part of the Copyright Office’s eighth triennial rulemaking process. Our proposed exemption would allow researchers to bypass technical protection measures (“TPMs”) in order to conduct text and data mining research on both literary works that are published electronically and motion pictures….”

ASReview – Active learning for Systematic Reviews

“Anyone who goes through the process of screening large amounts of texts such as newspapers, scientific abstracts for a systematic review, or ancient texts, knows how labor intensive this can be. With the rapidly evolving field of Artificial Intelligence (AI), the large amount of manual work can be reduced or even completely replaced by software using active learning.

By using our AI-aided tool, you can not only save time, but you can also increase the quality of your screening process. ASReview enables you to screen more texts than the traditional way of screening in the same amount of time. Which means that you can achieve a higher quality than when you would have used the traditional approach.

Consider the example of systematic reviews, which are “top of the bill” in research. However, the number of scientific papers on any topic is skyrocketing. Since it is of crucial importance for the advancement of science to produce high-quality systematic review articles, sometimes as quickly as possible in times of crisis, we need to find a way to effectively automate this screening process. Before Elas* was there to help you, systematic reviewing was an exhaustive task, often very boring….”

ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Abstract:  Open research data are increasingly recognized as a quality indicator and an important resource to increase transparency, robustness and collaboration in science. However, no standardized way of reporting Open Data in publications exists, making it difficult to find shared datasets and assess the prevalence of Open Data in an automated fashion.

We developed ODDPub (Open Data Detection in Publications), a text-mining algorithm that screens biomedical publications and detects cases of Open Data. Using English-language original research publications from a single biomedical research institution (n = 8689) and randomly selected from PubMed (n = 1500) we iteratively developed a set of derived keyword categories. ODDPub can detect data sharing through field-specific repositories, general-purpose repositories or the supplement. Additionally, it can detect shared analysis code (Open Code).

To validate ODDPub, we manually screened 792 publications randomly selected from PubMed. On this validation dataset, our algorithm detected Open Data publications with a sensitivity of 0.73 and specificity of 0.97. Open Data was detected for 11.5% (n = 91) of publications. Open Code was detected for 1.4% (n = 11) of publications with a sensitivity of 0.73 and specificity of 1.00. We compared our results to the linked datasets found in the databases PubMed and Web of Science.

Our algorithm can automatically screen large numbers of publications for Open Data. It can thus be used to assess Open Data sharing rates on the level of subject areas, journals, or institutions. It can also identify individual Open Data publications in a larger publication corpus. ODDPub is published as an R package on GitHub.

 

ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Abstract:  Open research data are increasingly recognized as a quality indicator and an important resource to increase transparency, robustness and collaboration in science. However, no standardized way of reporting Open Data in publications exists, making it difficult to find shared datasets and assess the prevalence of Open Data in an automated fashion.

We developed ODDPub (Open Data Detection in Publications), a text-mining algorithm that screens biomedical publications and detects cases of Open Data. Using English-language original research publications from a single biomedical research institution (n = 8689) and randomly selected from PubMed (n = 1500) we iteratively developed a set of derived keyword categories. ODDPub can detect data sharing through field-specific repositories, general-purpose repositories or the supplement. Additionally, it can detect shared analysis code (Open Code).

To validate ODDPub, we manually screened 792 publications randomly selected from PubMed. On this validation dataset, our algorithm detected Open Data publications with a sensitivity of 0.73 and specificity of 0.97. Open Data was detected for 11.5% (n = 91) of publications. Open Code was detected for 1.4% (n = 11) of publications with a sensitivity of 0.73 and specificity of 1.00. We compared our results to the linked datasets found in the databases PubMed and Web of Science.

Our algorithm can automatically screen large numbers of publications for Open Data. It can thus be used to assess Open Data sharing rates on the level of subject areas, journals, or institutions. It can also identify individual Open Data publications in a larger publication corpus. ODDPub is published as an R package on GitHub.