AI-enabled large language model speeds up wells data retrieval – but must be used with care
Use of the tool can streamline workflows, but model-generated responses still need to be considered in context with domain expertise
By Jessica Whiteside, Contributor
Data management and analytics aided by AI could open a new frontier of discovery for the oil and gas sector, uncovering performance-enhancing insights from the vast and intricate data generated by today’s digitally enabled operations. That is the promise driving companies to invest in the smart tools, sensors and cloud technologies of the industry’s unfolding digital transformation.
For many companies, however, the gusher of data generated, current and historical, is overwhelming. Much of this information remains untapped and siloed in various databases across an organization, perhaps stored in rig equipment, in years of well logging records, in presentation slides, or in PDF reports.
Improving data access and usability were key drivers behind an information retrieval study launched by technology company Intellicess and Apache, which tested the use of AI to get the right information to users at the right time, with the goal of optimizing well construction planning. The study deployed a generative pre-trained transformer (GPT) large language model (LLM), paired with a retrieval-augmented generation (RAG) model, to pull information from a dataset associated with more than 200 wells in a region where Apache was actively drilling.
This dual-model approach made it much easier for users to access data relevant for well planning and led to significant time savings, said Michael Yi, Chief Data Scientist with Intellicess, at this year’s IADC/SPE International Drilling Conference in Galveston, Texas.
“In this industry, we have a lot of data, but because this data is so hard to reach, it makes it very difficult for people to use it,” he said. “Having a tool like this makes that process much more streamlined and can get people starting to look into the data if it’s as simple as just asking a question to a chatbot.”
Large language models
LLM-based AI can sift through extensive datasets and swiftly provide direct, context-aware responses to users’ requests. GPT is a family of LLM models developed by OpenAI that provides the foundation for the well-known ChatGPT chatbot and other AI applications requiring the generation of human-like text. Instead of spending a lot of time looking for a data source or studying large documents, a user can simply pose a question through the GPT LLM interface and get a response with the requested information.
“You get very personalized responses,” Mr Yi said. “The responses are also very conversational, so it’s almost like you’re texting a person. You can say, ‘Hey, what is stick-slip?’
“It does make it very natural to actually understand the information that it’s providing.”
Despite their linguistic and data-processing prowess, GPT LLMs have limitations when it comes to information retrieval. Because their learning process requires massive amounts of data, GPT LLMs are primarily pre-trained on publicly available data from the internet. This means they have a general understanding of the data on which they trained but may lack knowledge of recent developments and may not have access to the domain-specific or internal data that a wells professional might require.
There is also a risk that GPT LLMs could generate inaccurate or nonsensical information, and the veracity of their responses can be challenging to verify because they do not provide citations for the information they provide.
Retrieval-augmented generation
The study aimed to address these concerns in a well-planning context by integrating GPT LLM with RAG. RAG expands the data sources used to inform the model’s responses, enabling the inclusion of more current information or internal data. RAG features a retriever module that acts as a go-between to fetch domain-relevant information, such as an operator’s wells data, from a separate knowledge database.
When GPT LLM is enhanced with RAG, a user can ask the system specific questions about a set of operations as long as the relevant data is available, Mr Yi said. For instance, a user could ask which was the fastest well drilled by their organization in the past month.
“If it’s just the large language model itself, then of course it’s not going to know the answer. It doesn’t have any data sources to pull from — it may have some public information, but that’s likely not relevant to your operation,” Mr Yi said. But with RAG providing access to a user’s own data, the system may be able to respond, for example: “According to this monthly report that I pulled here, the fastest well drilled is Well A.”
Data curation and retrieval
The study used both structured and unstructured data for RAG, consolidating data from drilling programs, time logs for operational details and NPT events, presentations on well construction, information on drilling practices and technical papers. This diverse information was added to a cloud indexing and querying platform used as a vector database to enable advanced search functions. The indexing done by this platform is important because it makes sure that when users ask questions, the retrieval process is quick, Mr Yi said.
Curating this data to ensure it included only high-quality, well-organized information was pivotal to the effectiveness of the RAG framework but was also time consuming, he added. A vital step was the application of metadata, tags and categorization to help RAG filter the data during retrieval to ensure responses would be as accurate as possible based on the criteria in the user prompt (e.g., requesting data on wells in specific locations or over a specific time frame).
Quality control to ensure that information added to the database was relevant, reliable and up to date was also critical to generating effective responses, as was the use of standardized formats and units to support consistency and dataset comparisons. The interface that allowed the users to communicate with the GPT LLM model included a page to upload data and associated metadata and contextual information to the RAG model.
“It’s extremely important to make sure that the data you give it is clean and as good as possible,” Mr Yi said.
Transparency and reliability
Users can also help improve the usefulness and accuracy of the tool by crafting query prompts that clearly state the context or level of detail they expect to receive in a response. For example, this could be the difference between asking, “What is rotary stick-slip?” and “Can you describe the causes of stick-slip and how it is detected using surface data?” Questions posed to the platform during testing included “What happened the last time there was a stuck pipe in this region?” and “What is the best ROP that could be attained in the lateral section?”
Regardless of the question posed, users need to maintain the mindset that they can’t always take the responses at face value. Mr Yi cautioned: Your own domain expertise remains essential as the tool is not a definitive source of verified information.
“Look at the answer and say, ‘Is this realistic or is it not realistic?’ ” he said. “You do need to have some basic knowledge on the questions you’re asking.”
A benefit of adding RAG to a GPT LLM is that the answer returned by the model can include citations for the source data. That way, the user will know where the information came from and can verify any uncertainties. The model can also be prompted to say, “I don’t know,” if it doesn’t have the requested information.
“It helps prevent this language model from just being a black box so you can see what information it’s using to communicate the answer to you. It helps validate the response,” Mr Yi said.
Results and future considerations
The study used the combined GPT LLM and RAG approach to provide engineers and other users with answers to questions on real-time drilling practices; the information also helped them understand historical wells for well-planning purposes. Testing assessed the speed and accuracy of the model’s responses and found that the combined use of GPT LLM and RAG gave users faster access to data and associated insights, while the availability of citations increased their understanding of the information’s validity.
This streamlined access to traditionally fragmented data holds a lot of potential for optimizing data-driven decision making in well construction, Mr Yi said. However, more work is needed to streamline the process and improve reliability, such as automating data collection and processing to minimize manual curation and adding feedback mechanisms so users can correct errors in responses. At this point, it’s still a technology that must be used carefully, he said.
“It’s crucial to understand that there are risks of using it. There needs to be an understanding that this is a tool, not always the 100% answer. But as long as you understand that, I think it can be very, very, very useful in making your workflow much more efficient and clean.” DC
For more information, please see IADC/SPE 217700, “Applications of Large Language Models in Well Construction Planning and Real-Time Operation,” presented at the 2024 IADC/SPE International Drilling Conference, 5-7 March 2024, Galveston, Texas.