Data-friendly AI systems

//by Klaus Meffert//

Artificial intelligence (AI) has been on the rise since the launch of ChatGPT. The possibilities of new methods are enormous. Thanks to powerful hardware, especially graphics cards, computationally intensive electronic brains can now provide sophisticated answers in short period of time.

AI systems are based on mass data. The more data available for training an AI, the better. This first training phase is called pre-training.

The core of an AI system is the so-called model. Many people are now familiar with the term “language model” or “large language model”. The adjective “large” stands for the enormous amount of training data and the immense size of the electronic brain.

Pre-trained AI models are ready-made electronic brains that can be used for predetermined tasks. However, they can also be fine tuned for specific tasks.

Fine tuning does not require as much data as pre-training an AI model. Often even 10 examples are sufficient to achieve a noticeable optimization.

When questioning a language model, the AI system accesses its training data. This data is stored in coded (compressed) form. It can be compared to the human brain and its neurons. However, it is easily possible, and systemically even intended, that exact quotes from the training data are included in AI responses. The same applies to AI systems that generate images or music.

Anyone who has access to an AI system can potentially extract all input data. In addition, the data which is part of the questions of the users of a language model can be stored by the operator of the AI system. The training data and similarly the input data can potentially be disclosed in AI responses.

That means firstly, that the user has to check in advance whether he is bound by any legal restriction which covers the use of the input data. Secondly, the user can restrict the handling of his data with contractual clauses against the AI provider.

However, such legal restrictions will not eliminate the risk that the user de facto looses the control over his data. Even if the data is covered by GDPR, trade secret or copy right law, the user gives up his sovereignity over his data and on how his data will be processed by the AI system.

In addition and regardless of any legal prohibitions, a data transfer to third parties is often undesirable for a data owner.

The only way to really ensure the protection of data in the AI environment is to use your own AI system. I name such systems self-sufficient AI systems or local AI systems because they are operated on separated local servers.

To summarize the core intent, I call such AI systems data-friendly. The main characteristic of a data-friendly system is that the flow of data is completely under control. Data-friendly AI systems do not require an internet connection because they do not send data to third parties.

Data-friendly AI systems

The question is how to create data-friendly AI systems that work without ChatGPT or similar systems?

The good news is that there are now extremely powerful open source components for AI systems.

These components are essentially

– program libraries

– AI models

Program libraries are frameworks that allow the creation of your own AI applications in the shortest possible time. The Python programming language provides the technological basis.

AI models are prefabricated electronic brains. An AI model consists of files. Typically, these are main files and configuration files. The main files form the electronic brain. Smaller models consist of just one such file. Larger models are divided into several files. The file sizes are several gigabytes each. Configuration files tell the program library on which architecture a model is based. All those files can be downloaded to your computer, and then be used for AI applications.

With the combination of program libraries and AI models, high-performance AI applications can be programmed during a short period of time.

Typical use cases are:

Question-answer system (chatbot)
Translation (now between over 100 languages, in every conceivable combination)
Automatic reasoning
Audio transcription
Image generation
Object recognition
Video generation
Data analysis

The recent development during the last two years shows several new aspects of AI applications.

1. Open source

With a few exceptions, open source is driving the relevant developments in the AI sector. Every day, new program libraries or AI models are available per download.

Research papers – mostly published on arxiv.org – often result in program libraries. In the past, a research paper was published without providing any technological solution. Today, the associated program code is simultaneously published whether by the author of the research paper or by other programmers.

This means that extremely powerful options are available for every organization.

2. Hardware

Essentially, AI calculations take place on the many thousands of processor cores of graphics cards. A graphics card (GPU = Graphics Processing Unit) is misused to perform a large number of floating point calculations extremely quickly. GPUs are often 100 times faster than CPUs. A CPU is a conventional processor in a computer. While CPUs come with 24 cores depending on the model, GPUs deliver 15,000 or more cores.

Thanks to these new possibilities AI calculations can be carried out properly in the first place.

The first calculation is the pre-training of an AI model. This typically takes several 100,000 hours, even on high-performance GPUs.

Further calculations are necessary when an AI model is queried and an answer is to be generated. This process typically takes just a few seconds.

3. New methods and algorithms

With the help of increasingly sophisticated mathematical methods, AI models can be trained or queried faster and better. One example is the Stable Diffusion Turbo. A new model architecture provides the possibility to generate an image for a text input in less than a second on affordable hardware. A few months ago, this operation took around five times longer.

Due to these framework conditions, there are various possibilities for creating your own AI systems. Depending on the requirements, such projects can be implemented very quickly and have a level of performance that was barely imaginable a few years ago or was reserved for a few powerful companies.

Conclusion

Hardly any other type of system processes as much data as an AI system. Data has long been considered a valuable raw material. Now the value of mass data has increased even further since mass data serves as a breeding ground for electronic brains.

Anyone who enters data into an AI system can be sure that his data will be processed by the AI systems (for personal data, see Art. 4 (1) GDPR). The AI system could not operate otherwise. In addition to personal data, trade secrets and other confidential data are likely to be the most critical category of data for many companies. This data is particularly worth being protected.

Every owner of valuable data should check carefully to whom the data is entrusted to. Misuse can not be prevented solely formally by legal agreements. The risk of the data transfer to a third party is the loss of control of the data.

These risks do not exist when a company uses its own AI system. In any case, the data gold is stored securely in your own systems. Data access of third parties and technological dependence on third parties is eliminated.

Let the AI age rise by using your own data-friendly AI systems. A good starting point is a search function which is powered by AI and is able to search for your internal documents or e-mails.