CONSULTING

Planning, Early stages sample data pulling and exploration

  1. SpringBoard: A consulting service for investigators seeking insight and input on acquiring, storing, and using medical data for research. SpringBoard provides insight on self-management, level of effort and core service recommendations to ensure that investigators are appropriately resourced to complete their study with adequate support and statistical power. Consultations are available throughout an investigator’s entire research pipeline whenever data issues arise. This includes up to 2 hours of free assessment for faculty and staff.
    • Faculty Involvement Procedure: Upon receiving SpringBoard requests, the primary core faculty members and senior engineer staff will identify which faculty members from BMI should be invited to the projects, ensuring the appropriate expertise is engaged early on.
  2. CohortCount: Assessing patient population size at Emory Healthcare and affiliated facilities to determine if there is a sufficient patient population fulfilling the cohort criteria to conduct a study.
  3. GrantGen: Providing researchers with tailored text for grant proposals, IRB protocols and project reports on Biomedical Informatics pipelines, including data acquisition, processing, infrastructure, and relevant statistical summaries.
  4. Project Management: Providing comprehensive project management services, including planning, execution, monitoring, and reporting. This ensures projects are delivered on time and within scope. We provided services in planning and coordinating project activities, including core service activities, according to project-specific requirements and constraints, tasks, and timelines. Available for the project lifecycle from conceptualization to completion.



DATA EXTRACTION

Large scale data extraction and exploration

  1. DataDig (clinical data): Facilitating the extraction of clinical data from electronic health records tailored to the needs of the project. This involves retrospective data extraction using various formats such as physiological data (time-series, images and video), text (e.g. clinical notes), flat files, and snapshots. In addition, DataDig provides services for multimodal data integration from various resources, data normalization, ontology encoding, de-identification, and cleaning.
  2. DataGrab (non-medical sources): Addressing the need for multimodal data from devices or resources beyond standard medical databases, such as wearables, sensors, mobile phones, in addition to social media and public databases (e.g. for studying social determinants of health).
  3. Streaming Data (continuous data feed): Providing continuous data feeds for real-time machine learning models. This service supports projects requiring up-to-the-minute data for dynamic model training and prediction.
  4. Synthetic Data and Digital Twins: Generating synthetic datasets and creating digital twins resembling real data to support research on model development, data augmentation, and addressing challenges in imbalanced and insufficient data for building and training deep learning models. This service includes simulating real-world conditions and testing scenarios in an AI-ML friendly virtual environment. This service also extends to creating artificially generated datasets based on specified data distributions, suitable for development and testing of computational methods, and as an alternative to de-identification for data sharing.



DATA ANALYSIS

Data analysis using machine learning, deep learning, natural language processing and large language models

  1. DataHack: BMI faculty and staff represent a valuable resource of state-of-the-art knowledge in the analysis of medical data, ranging from the application of signal processing to deep learning. The DataHack service runs feasibility studies and provides advice on which experts to work with and potential techniques to employ on data.
  2. Visual Analytics – Dashboards: Developing visual analytics dashboards to aid researchers in interpreting complex datasets. These dashboards provide real-time insights and facilitate data-driven decision-making.
  3. ML-based Data Analysis and Processing: Offering advanced data analysis and processing services using state-of-the-art model-based and data-driven techniques in machine learning and deep learning models.
  4. NLP-LLM Data Analysis: Our team at BMI has substantial expertise in natural language processing (NLP) and large language models (LLMs). With the emerging demand for utilization and integration of NLP-LLM tools and services in biomedical informatics applications, we provide services including:
    • LLM Services:
      • Prompting strategies including but are not limited to chain-of-thought prompting and in-context learning. This service also includes the exploration of prompting strategies and identifying optimal ones, and minimizing hallucinations. Both hard and soft prompting options will be available. In the near future, we also aim to provide trainable prompting (optimizing prompting automatically via training).
      • Automated coding/annotation: This service will replicate human coding of text-based data to make the process more scalable.
      • Information extraction and named entity recognition.
      • Employing state-of-the-art information extraction methods, including supervised and unsupervised LLM-based approaches. Benchmark and optimize multiple named entity recognition algorithms (including rule-based and machine learning-based).
      • Fine-tuning/customizing LLMs.
      • Fine-tune existing open-source LLMs with internal Emory Healthcare data. Fine-tuning strategies include domain-adaptive pretraining, source-adaptive pretraining and topic-specific pretraining.
      • De-identification of textual health records using customized and high-precision de-identification methods. This includes physicians’ clinical notes, which are invaluable as they provide context by experts at the point-of-care; but in many cases include patient identifiers that are difficult to de-identify.
      • Lexicon expansion: Most NLP of clinical notes still involve the creation of lexicons and knowledge bases. These are used to detect concepts. However, not all variations are typically encoded in a lexicon. We use large language models and semantic similarity-based strategies for automatically expanding existing lexicons with more EHR-specific variants.
      • Customized language models: i) Train context-free language models from scratch (e.g., word or phrase level embeddings). ii) Further pretrain existing transformer-based models with EHR (or any other) data. iii) Fine-tuning existing open-source LLMs.
      • Retrieval Augmented Generation: Constrained text generation with retrieval engines at the back-end.
      • Quantization of models: Task-oriented quantized models that can be deployed in low-resource environments.
      • Feasibility consultation: Provide an assessment of the feasibility of conducting a specific NLP task involving LLMs or otherwise.
    • NLP Services:
      • Regular expression preparation: Customized for detecting complex lexical patterns in clinical notes.
      • Fuzzy matching: Inexact matching with thresholding for detecting concepts even when they are misspelled or expressed in some non-standard form.
      • Cohort discovery from text: Develop strategies for creating cohorts from clinical notes when ICD codes do not cover the population of interest. Particularly useful for detecting rare cohorts. Methods employed for cohort creation involve rule-based NLP (e.g., fuzzy matching) and supervised classification.
      • Supervised classification: Employ state-of-the-art supervised classification methods. Benchmark and optimize multiple supervised classification algorithms (including traditional and transformer-based) on the same data to identify the best strategy. Can be useful for a variety of tasks including cohort discovery.
      • End-to-end pipelines: Provide solutions (design and implementation) for end-to-end processing of clinical notes involving multiple NLP and machine learning modules.



SOFTWARE/MODEL DEVELOPMENT

  1. Application Development (frontend and backend): Developing mobile and web-based applications and backend systems necessary for research activities. This involves both software development and model creation to support data analysis and processing. This applications will support data collection, analysis, and real-time monitoring. This includes creating user-friendly interfaces and ensuring data security.
  2. Machine Learning Algorithm Development: Creating and refining machine learning algorithms tailored to specific research needs. This includes standardized training, validation and testing of state-of-the-art machine learning and deep learning models.
  3. ML/NLP/LLM Computation Optimization: machine and deep learning models trained on very large datasets require significant computational resources. In cloud-based systems, the computational load directly translates to the cost of cloud credits. In on-premises processing, this relationship is indirect, but in the long-term equates or even exceeds cloud-based computations (factoring for expert human resources, maintenance, repair, replacement, electricity, cooling systems, backup mechanisms, etc.). Therefore, with a team of skillful staff and faculty with formal training and extensive experience in computer science and AI, we provide services for computational optimization of these models. 
  4. ML/NLP/LLM model tuning and transfer learning: transfer learning, model tuning/adaptation are key techniques in big data, where it is computationally unaffordable/infeasible to retrain models from scratch. We have significant transfer learning and model tuning experiences in our team and will provide this as a core service.



SOFTWARE/MODEL DEPLOYMENT

  1. Application/Software Hosting: Providing hosting and deployment services for applications and software on the BMI HPC cluster.
  2. Infrastructure Support: Offering comprehensive support for computation, model development and deployment infrastructure, including maintenance and integration with IT services to facilitate smooth operation and data flow on funded projects.
  3. Software/Model Validation: We are planning to provide model validation as a service, by conducting rigorous validation of software and models to ensure they meet the best and standard procedures of training, validation and testing on unseen data. This includes testing for accuracy, reliability, and compliance with regulatory requirements. We will provide this as a service to Emory and external researchers seeking regulatory clearances such as FDA approvals. FDA has a Medical Device Development Tools (MDDT) program, which aims to streamline the development, evaluation, and innovation of medical devices by qualifying tools/data that sponsors can use to support regulatory submissions. Entities with access to large and standardized datasets or with expertise in model development and validation, may apply to contribute to the MDDT program. Qualified entities serve as third-parties in evaluating algorithms and software for companies seeking FDA approval. Leveraging our team’s unique experience in organizing standardized biomedical informatics-focused data challenges (including the PhysioNet Challenge and related hackathons), and building on Emory’s data warehouse, our vision for the MIAI core is to join the MDDT program, enabling us to provide software and model validation as a service.  
  4. ML Model Deployment: Facilitating the deployment of machine learning models into operational environments and dashboards.
  5. NLP/LLM Model Deployment: We will provide a similar service for NLPs and LLMs. While many aspects of NLP/LLM model deployment are similar to ML model deployment, there are also unique challenges, requirements and current opportunities, which will make this an appealing service for the core customers.
  6. Data Management: We provide the systematic organization, storage, and maintenance of identified, limited, and deidentified clinical and associated research data to ensure its accuracy, accessibility, and security using on-premises and cloud resources. Key aspects include: 
    • Grant Writing: We Assist PIs in designing and implementing Data Management plans for their grants. 
    • Data Management and Storage: Gathering data from various sources and storing it in AI-ML ready structured formats in databases and data warehouses, and unstructured data in data lakes.  
    • Data Security: Protecting data from unauthorized access and breaches through encryption, access controls, and other security measures. 
    • Data Governance: Establishing policies and procedures for data management to ensure compliance with regulations and standards such as existing IRB approvals and data sharing agreements.
    • Data Lifecycle Management: Managing data throughout its lifecycle, from creating and storage to archiving and deletion. 
    • Advise and support data upload and submission to public data repositories based on Data Management Plans.