Mike Vizard: Hello, and welcome to the latest edition of The Techstrong AI video series. I’m your host, Mike Vizard. Today we’re with Gharib Gharibi, who is head of research in AI and privacy for TripleBlind. And we’re gonna be talking about the use of data to train AI models and how that gets applied in health care without necessarily violating everybody’s privacy rights. That’s a tall order right there. Gharib, welcome to the show.
Gharib Gharibi: Thank you, Michael, excited to be here.
Mike Vizard: We’re all experiencing a certain amount of irrational exuberance with all things AI these days. But are we taking into account the privacy issues that go with sharing data with all these fancy new generative AI models that are out there?
Gharib Gharibi: Yes, things are becoming more complicated as these models are becoming more complicated and bigger in size, which for the underlying model now has a couple of 100 billion parameters. So this large number of parameters trained on almost the entirety of the internet makes it really complicated to understand these models. How much of the data do they actually memorize verbatim? Do they understand the context of private information and not to leak them, etc? So it is still early to tell how bad data privacy with this large language models is. And it’s going to be in the future, because under the hood, these models are still black box models. That’s how we describe neural networks or deep learning basically. So black box models, because we don’t really know how they generalize. And whether or not they understand what they’re saying. And therefore, it is difficult to judge how good they are at understanding private information and withholding, disclosing this private information for entities that should not know such private information. So it’s still very complicated process.
Mike Vizard: Is it the case that the health care industry will then force them down a path towards building or relying on large language models that are isolated or that they build themselves and are domain specific, because otherwise, they’ll run afoul of any number of compliance issues?
Gharib Gharibi: Yes, that’s exactly the main reason that AI still cannot really do great results in the medical domain the same way it’s doing outside the health care domain, because in the medical domain data is about people. And privacy nowadays is a fundamental human right. And therefore, they need to be very careful, a lot of health care providers actually worry and care about patient’s data. So they’re ethical, and they make sure that before they adopt this technology, they are doing the right thing. Other health care providers are forced by regulations to actually follow the rules, such as HIPAA privacy policies here in the United States. So all of these problems actually make it very difficult for them to adapt or fine tune a model like chargeability, for example. This forces them to install GPT on their Microsoft Cloud, because mainly Microsoft is what’s contracted with OpenAI, for example, to do these big models. So this means that health care providers might have to push their data up to the cloud, identify it, put it on the cloud so they can access these large language models. Because companies like OpenAI, unfortunately. But today, these models are proprietary to them; they’re also not going to easily disclose the underlying architecture of these models, the weights of these models and how they work. So it is really tough to adopt these models. And that’s where companies like TripleBlind are trying to solve these issues. We’re trying to enable health care providers to fully utilize their data and have access to these large language models without compromising the privacy of the people of the patients and these datasets. So in short, yes, it is a hurdle it is challenging to adopt this cutting edge technology today with by health care because of privacy concerns, and security as well. But there is very promising work around a new methods to creating these models such as probably have heard of federated learning was created in Google around to the end of 2016 2017. TripleBlind for example, today, we have even a better solution to train models on decentralized data without actually having to send the data outside its source. We call that method blind learning because the data owner is blind to The model, which is important for model creators, and the model owner is blind to the data, they don’t get to see the data ever. And this way two parties, a data owner, and the model owner or a model creator can collaborate together on data. And a lot of times on data that’s coming from different sources, so different hospitals, multinational, even multi international datasets without actually compromising the underlying privacy. So there is a lot of advancements in the privacy domain that enables that. But from the privacy perspective, the efficiency, the usability, and the privacy also need to be balanced. So a very private system means it learns very little about the training data, which means that the system might not be as efficient and as effective as we would like it to be. So today, we’re trying to balance between the usability and the privacy of these methods that we are creating.
Mike Vizard: And you look at this whole space. In the European Union, they have this thing called GDPR. I’m sure you’re familiar with it. And part of that says you have the right to be forgotten. So how does one, create or ask an AI model to be forgotten?
Gharib Gharibi: That’s a great question. It’s and it is a very tough problem, there is an entire domain and machine learning or AI called machine unlearning. And the goal of this entire domain is to actually make a model for get information about specific patient and the ways to solve that. There’s a lot of so many different ways to solve that problem. There is not one bullet Silver Bullet yet to solve that issue. So some of the ways is like, we can train this hierarchies of models. And we know model number 13 was trained on this number of patients. So if one of these patients actually asks to be forgotten, then we remove that entire model. And we delete it from the big model, which is made as an ensemble of these small models that trade on subgroups of the patients. So if somebody has to be forgotten, we’ll delete that part of the model. We refine tune that sample model to make sure that it still has the same efficiency and performance. But that’s, that’s problematic, right? Every time a patient wants to be forgotten, we have to delete part of our model. So that’s not really practical today and the word, some other approaches tend to say, well, and this is maybe in the domain of differential privacy, it says, Well, if we cannot extract information from this specific model, about that specific patient that asked to be forgotten, this means that the model has generalized well enough that it cannot memorize that specific person and therefore, this model does not actually remember you. And there is actually very well established methods to measure something like that. So you bring a data set. With that person’s record or patient’s record and another data set, we call it at JSON data set without that specific person record. At whatever computation we are doing on these two data sets have a very similar answer response, if it is a model, that strain produces very similar predictions, then we can say that this model cannot actually differentiate your existence or absence from its knowledge, and therefore you’re forgotten. So solving this problem, there’s different ways to solve it. And the changes drastically from theoretical solutions and research papers to more practical solutions that can still provide models that are efficient and general enough that they do not memorize specific people that were used to train these models.
Mike Vizard: Can we just anonymize the data? I mean, how far can we give you data for not including, you know, personally identifiable information?
Gharib Gharibi: Yes, since we had work, one of the great science scientists that used to work for Google said once she’s also established the differential privacy domain. She said anonymization isn’t is not it does not work. And you might have heard about the Netflix prize, where Netflix long time ago tried to create a very good recommendation system. And what they did is basically they de identified all people’s users names, those people who watch things on Netflix and write reviews. So they stripped of the user names of people, the reviews and they put it outside publicly and created a competition for people to build a very good recommendation system. We’re all computer scientists and privacy interests. As they went, and they started trying to figure out who are the users that created the IRS, comments or reviews about the movies, and they were able to extract from social media networks or from other platforms where people can put reviews about these movies, and try to match that people write together, because if I’m someone who’s interested in writing a review for a movie a specific movie, then I’m most likely going to put it on Netflix, on Rotten Tomatoes on other platforms that show that movie. So just by removing my user name from that review doesn’t mean you can actually, you cannot actually find me. So the identification, specifically in the era of generative AI does not really work, the regulations and the HIPAA we have today, I greatly believe that it is not sufficient. If you just remove my age, or you actually shifted by five years, but you keep my sequence, it’s very easy to identify that person. Or if you know, when I went to the hospital, when I left the hospital, and I have a specific disease, I might have tweeted about it. So you can do a record linkage and be able to understand that person who did that they’re in the hospital for disease x, or their someone they love died at that period of time. And then you have a de identified data set, it’s very easy to link these records. So the identification on itself by removing direct personal identifier is not sufficient. And that’s why we need a new methods that’s based on secure multi party computation. Data should never leave at source. And this is some of the privacy by design, ways that we use a triple blind to make sure that there’s stronger privacy guarantees that the identification
Mike Vizard: One of the upsides of health care in AI is that we should be able to even out the care because you can go to one doctor in Washington, and they might totally diagnose what your issue is. And the other one will take three or four tries at it. So will the overall improvements in health care warrant a lot of these investments, because we’ll be able to, I don’t know, for example, get to the root cause of why there are clusters of diseases in specific areas?
Gharib Gharibi: Yes, I think it will lead to that, I guess; it’s a two-sided issue, right? It might actually widen the gap between these health care providers that today have lots and tons of data and enough expertise and resources to train these AI models. But on the other hand, if we actually bring together regulators, academia, industry, and make sure that we are building these systems in the future to reduce inequity and access to everyone, it might actually help a lot. AI models, at least the day, they are data centric, the more data we have, the better these models are going to perform. And therefore we need data from different geographical locations from different populations. This means that we need to access data from different health care sources. And there’s some other economic models that will make us also distribute these trained models across all the data providers to train that model. So it might actually level out these decisions and make making decisions on the health care domain easier.
Mike Vizard: Do we need to worry about biases being introduced into these AI models? Because the people building them have an agenda? They have cost control issues? There are racial factors that go into all this stuff. I mean, how do we can need to think through this whole bias equation?
Gharib Gharibi: That’s a great question. And we have to worry about that. There’s both you described the malicious agenda, for example, so intended bias. There’s also unintended bias. Again, I just I just mentioned earlier, the systems we have today are data centric. So AI models are nothing but programs that automatically generalize from the data we have. So if the data we have is already biased, then the models we have are most likely to be as also biased. So think of a hospital that serves a rich people area in a rich people area, mostly maybe old white men. Okay, so that’s a model that will be generic generated; that will not really perform well on other different types of demographics. And that might be unintended bias. There is also malicious bias. So health care providers might want to actually reduce the cost on a specific demographics of people’s health insurance – might use this information to actually increase the premiums if they know that you’re specifically more likely to have a specific disease. Again, the problem of bias is it exists already today. And we have seen several examples of it and actual applications of AI, whether it’s in decisions, decision support systems for hiring, or systems that were used in courts, etc. So bias already exists in AI systems. And we need to address that. Today, these systems are still data centric. So we have to make sure that when we are training these models, we are using data that is good enough to that has high quality and it actually covers all possible outcomes and demographics of people, etc. And that’s only possible again, by these methods that enabled training AI models on decentralized data while preserving the privacy of the people. So again, regulators, industry, businesses needs to come together; a lot of systems need to be rigorously validated before they are put in use. So we need validation. We need rigorous testing, we need regulations. We need open source systems, we need proprietary systems to address all the potential risks of AI not only bias.
Mike Vizard: So do you think that the regulators are up to speed on the implications of AI? Or has this whole thing left them in the dust and they’re still trying to figure out how to spell it?
Gharib Gharibi: They are very behind and they don’t understand what’s – I doubt they really have a grasp on what’s going on. I mean, even if you look at the AI community today, the AI community, the scientists that are building these AI systems; there are different camps and different groups with different opinions. Some of them think AI has a super bright future that will help us solve a lot of problems and environment, introduce environment solutions, etc. And then there’s other scientists and other groups and within the AI community calling to withhold and stop progress on building these AI systems, and that AI might result in human extinction. So if the creators of the AI systems still struggle to understand these systems and their implications, what do you think about regulators? I don’t think they they really have assets today, but they need to help be part of the solution and the future.
Mike Vizard: So what’s your best advice to folks in the health care sector right now? What should they be doing? Because clearly messing around with the public brute force, general purpose AI might not be up to the mission. So what should they be thinking about?
Gharib Gharibi: They should be getting in touch with me to tell them how they can build AI systems in a private and secure way. We have solutions today that actually enable they are not as efficient as like, if you write a prompt in a fully secure ChatGPT version, you will not see the answer immediately right away on your screen. It might take a couple of minutes instead of parts of a second or 1,000th of a second lecture, GPT, it’s blazingly fast, GPT-3 specifically, it’s going to be a little bit slower; but that efficiency hit comes at the reward of privacy. So there are a lot of solutions out there today that enable you to build a systems that are very good for your specific solution. And these AI systems do not have to be super large, like GPT-4 with a hundreds of millions of parameters. There’s evidence today that if you train a narrow AI in a specific field, a small model, even something like NanoGPT, which is a toy example that was built by one of the leaders in the AI domain just to show how GPT is built under the hood; that model is available on GitHub. Anyone can download it locally and find it turn it on their own data and a private and secure way. That model size is about couple of 100 or maybe 240 million parameters. So it’s nowhere in size to charge up for. But for a narrow AI task, it proved that it can do very good so we can train it to predict ICD codes of the patients from doctor notes or give answers to patients about their genomes or things like that. So there’s our incurs the health care domain to continue and faster accelerate AI research and their domain while paying great attention to the privacy, security and other ethical implications of this usage. So don’t slow down, just make sure that you are following ethical privacy preserving methods to training these systems and serving them.
Mike Vizard: Alright, folks, the more things change, the more they stay the same. An AI model that is a Jack-of-all-trades is, as always, master of none. Gharib, thanks for being on the show.
Gharib Gharibi: Thank you very much. I appreciate the opportunity of being here.
Mike Vizard: And thank you all for watching the latest episode of Techstrong.ai. You can find this episode and others on the Techstrong.ai site. We invite you to check them all out, and until then, we’ll see you all next time.