• 0 Posts
  • 10 Comments
Joined 6 months ago
cake
Cake day: August 27th, 2025

help-circle

  • You’re over-egging it a bit. A well written SOAP note, HPI etc should distill to a handful of possibilities, that’s true. That’s the point of them.

    The fact that the llm can interpret those notes 95% as well as a medical trained individual (per the article) to come up with the correct diagnosis is being a little under sold.

    That’s not nothing. Actually, that’s a big fucking deal ™ if you think thru the edge case applications. And remember, these are just general LLMs - and pretty old ones at that (ChatGPT 4 era). Were not even talking medical domain specific LLM.

    Yeah; I think there’s more here to think on.


  • Agreed!

    I think (hope) the next application of this tech is in point of care testing. I recall a story of a someone in Sudan(?) using a small, locally hosted LLM with vision abilities to scan hand written doctor notes and come up with an immunisation plan for their village. I might be misremembering the story, but the anecdote was along those lines.

    We already have PoC testing for things like Ultrasound… but some interpretation workflows rely on strong net connection iirc. It’d be awesome to have something on device that can be used for imaging interpretation where there is no other infra.

    Maybe someone can finally win that $10 million dollar X prize for the first viable tricorder (pretty sure that one wrapped up years ago? Too lazy to look)…one that isn’t smoke and mirror like Theranos.


  • Funny how the hivemind over looks that bit enroute to stunt on LLMs.

    If anything, that 90% result supports the idea that Garbage In = Garbage Out. I imagine a properly used domain-tuned medical model with structured inputs could exceed those results in some diagnostic settings (task-dependent).

    Iirc, the 2024 Nobel prize in chemistry was won on the basis of using ML expert system to investigate protein folding. ML =! LLM but at the same time, let’s not throw the baby out with the bathwater.

    EDIT: for the lulz, I posted my above comment in my locally hosted bespoke llm. It politely called my bullshit out (Alpha fold is technically not an expert system, I didn’t cite my source for Med-Palm 2 claims). Not all hope is lost with these things lol


    The statement contains a mix of plausible claims and minor logical inconsistencies. The core idea—that expert systems using ML can outperform simple LLMs in specific tasks—is reasonable.

    However, the claim that “a properly used expert system LLM (Med-PALM-2) is even better than 90% accurate in differentials” is unsupported by the provided context and overreaches from the general “Garbage In = Garbage Out” principle.

    Additionally, the assertion that the 2024 Nobel Prize in Chemistry was won “on the basis of using ML expert system to investigate protein folding” is factually incorrect; the prize was awarded for AI-assisted protein folding prediction, not an ML expert system per se.

    Confidence: medium | Source: Mixed



  • I don’t think it’s their information per se, so much as how the LLMs tend to use said information.

    LLMs are generally tuned to be expressive and lively. A part of that involves “random” (ie: roll the dice) output based on inputs + training data. (I’m skipping over technical details here for sake of simplicity)

    That’s what the masses have shown they want - friendly, confident sounding, chat bots, that can give plausible answers that are mostly right, sometimes.

    But for certain domains (like med) that shit gets people killed.

    TL;DR: they’re made for chitchat engagement, not high fidelity expert systems. You have to pay $$$$ to access those.



  • Agree.

    I’m sorta kicking myself I didn’t sign up for Google’s MedPALM-2 when I had the chance. Last I checked, it passed the USMLE exam with 96% and 88% on radio interpretation / report writing.

    I remember looking at the sign up and seeing it requested credit card details to verify identity (I didn’t have a google account at the time). I bounced… but gotta admit, it might have been fun to play with.

    Oh well; one door closes another opens.

    In any case, I believe this article confirms GIGO. The LLMs appear to have been vastly more accurate when fed correct inputs by clinicians versus what lay people fed it.



  • So, I can speak to this a little bit, as it touches two domains I’m involved in. TL;DR - LLMs bullshit and are unreliable, but there’s a way to use them in this domain as a force multiplier of sorts.

    In one; I’ve created a python router that takes my (deidentified) clinical notes, extracts and compacts input (user defined rules), creates a summary, then -

    1. benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).

    2. this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.

    Additionally, the llm generated note can be approved / denied by the python router, in the first instance, based on certain policy criteria I’ve defined.

    It can also suggest probable DDX based on my database (which are .CSV based)

    Finally, if the llm output fails policy check, the router tells me why it failed and just says “go look at the prior summary and edit it yourself”.

    This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing, which is approx a 5-7x speed up.

    The reason why this is interesting:

    All of this runs within the llm (or more accurately, it’s invoked from within the llm. It calls / invokes the python tooling via >> commands, which live outside the LLMs purview) but is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.

    Ive found that using a fairly “dumb” llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (5 out of 6 are graded as passed on first run by router invoking policy document and checking output). It’s too dumb to jazz, which is useful in this instance.

    Would I trust the LLM, end to end? Well, I’d trust my system, approx 80% of the time. I wouldn’t trust ChatGPT … even though its been more right than wrong in similar tests.