Is it possible to influence the output of Language Learning Models (LLMs)? Or to put it another way: Is there a way to ‘optimize’ your content to make it more visible within various LLM parameters? 

This is a question that has emerged from concerned marketers recently. 

In a nutshell, the answer is: You can, indeed. However, the process isn’t quite what you might initially envision.

Over 40% of AI professionals are currently exploring ways to optimize generative AI outputs

source: verbit.ai

And …

Around 60% of AI researchers believe influencing generative AI outputs is possible and necessary

source: Salesforce

Current Limitations / Disclaimers

Before we delve into the subject and discuss how to optimize your content’s visibility on LLMs, here are a few disclaimers:

  • Unlike traditional SEO, where optimization rules are relatively well-known and stable, AI algorithms are subject to change and adaptation. This makes it more challenging for brands to consistently optimize their content.
  • Usefulness always prevails, be it on SEO or LLMs. This implies that merely stuffing keywords or employing outdated SEO tactics is unlikely to be effective. Brands must produce genuinely useful, original content, which can be resource-intensive.
  • There are established knowledge cutoff dates:
    • GPT-4’s knowledge cutoff is April 2023.
    • The cutoff for GPT-3.5 is January 2022.
    • Google Gemini’s cutoff is in early 2023.

What is LLM optimization or generative AI optimization (GAIO)?

Generative AI optimization, often referred to as GAIO, aspires to aid businesses by positioning their offerings and brands within the generated outputs of top LLMs, including GPT and Google SGE.

These models command considerable attention as they can potentially sway many forthcoming purchasing choices.

Consider this scenario: In a Bing Chat search for optimum running shoes for a person weighing 96 kilograms and covering a weekly distance of 20 kilometers, you will find recommendations of brands like Brooks, Saucony, Hoka, and New Balance.

How are LLMs trained?

As fascinating as it may be, the training of large language models (LLMs) isn’t as imposing as it might sound.  What principles actually govern these training processes? 

Firstly, it’s crucial to understand that LLMs follow deep learning methodologies, which are category of machine learning methods based on artificial neural networks. From this perspective, LLM training primarily revolves around preparing the data – the raw material of this voyage. This involves transforming data into a specific format called JSONL, where each line is a unique prompt-completion pair or a training example. Remember, the finer the data, the smoother the journey. 

Moreover, weight selection plays a significant role, especially concerning techniques such as L2-SP(G) and Freeze-D and Freeze-G. Customarily, weights from 0.1, 1, and 10 are selected, with regularization weights all set to 1. It’s a fine balance that influences training speed, which incidentally, is faster when using Partial fine-tuning, based on the results from various experiments. 

Additionally, generative models can also be optimized using a minor network, a structure like a 2-layer Multi-Layer Perceptron (MLP) with ReLU activation functions. The choice of hyperparameters, for instance in the GLO model, is guided by the values proposed by the model’s authors. There are various methods used in tweaking these models to influence their outputs. For example, a method like MineGAN helps in achieving this goal. 

However, it’s key to remember – noise or mislabeling in training data labels, can negatively impact the supervised learning loss. It’s always important to maintain the integrity of the data to avoid such pitfalls and to enhance optimization in LLMs.

Equipped with these insights, let’s delve deeper into the fascinating world of LLMs and its multilayered training processes.

GPT data set

You may wonder, “What is the GPT dataset used for GPT-3.5 and GPT-4?” Well, let’s dive into that. At the heart of these cutting-edge language models, lies a robust and diverse dataset. This dataset is the fuel that propels the intricate workings of these AI machines, enabling them to deliver stellar performance in text generation, completion, and understanding tasks. 

Primarily, the GPT data set employed for GPT-3.5 and its anticipated successor, GPT-4, is intended for the following purposes: 

  • Train the model: The dataset serves as the learning resource, helping the model comprehend text semantics and structures, enhancing its competency to generate coherent and contextually relevant content.
  • Improve prediction capability: By exposing the model to a wide range of text data, it learns to predict subsequent words in a sentence more accurately, refining its text generation skills.
  • Facilitate fine-tuning: The dataset allows for model customization using techniques like supervised learning, enabling it to produce specialized outputs as per user needs.

These datasets, brimming with information from a vast range of sources – books, articles, websites, and more, provide a broad context for the LLM to learn from, allowing it to understand and emulate the idiosyncrasies of human language with increasing precision. 

In essence, the GPT dataset for GPT-3.5 and GPT-4 serves not just as a training resource, but as the foundation on which these models build their language understanding and generation capabilities.

GPT-3’s data was aggregated from the following different sources:

 # of tokensProportionBoosted
Common Crawl410 billion60% 
WebText219 billion22%5x
Books112 billion8% 
Books255 billion8% 
Wikipedia3 billion3%5x

(source: https://en.wikipedia.org/wiki/GPT-3 // https://arxiv.org/pdf/2005.14165.pdf)

  1. Common Crawl: This is essentially a replica of the web index. Common Crawl scans the web, freely providing its dataset and archive, which includes all sorts of content such as images, video assets, and links. It even contains different versions of the same website, similar to the Wayback Machine. For content optimization, note that the crawlers adhere to nofollow and robots.txt policies. As of June 2023, Common Crawl comprises about 3.1 billion pages and is estimated to include around 60 million different domains (source: https://en.wikipedia.org/wiki/Common_Crawl https://commoncrawl.org/overview).
  2. Books1 and Books2: These are akin to a vast library, referring to publicly available books, primarily published in English, totaling about 200,000 books. They were used to train the model, and it’s worth noting that content from Books1 and Books2 is slightly prioritized over Common Crawl (websites), according to the amplification factor.
  3. Wikipedia (English-only): While Wikipedia’s size is only 1% of Common Crawl based on the number of tokens, its content is boosted about 5x compared to the Common Crawl. Hence, its overall influence is around 3%. Given that Wikipedia’s content is typically better and of higher quality than the average website (common crawl), it’s understandable that its content was boosted.
  4. WebText2: Used by OpenAI as a quality content factor, WebText2 includes all Reddit posts/submissions with 3 or more karma votes/scores. In other words, it includes website URLs from Reddit posts that garnered at least 3 votes. Since not all 3.1 billion pages are of equal quality, WebText2 adds a layer of quality score. These Reddit submissions are boosted about 5x compared to Common Crawl, contributing to about 22% of the model’s total influence (source: https://openwebtext2.readthedocs.io/en/latest/background/).

Can the outputs of generative AI be influenced proactively?

Assuming that future AI models continue to follow a similar pattern, we propose the following four primary strategies to improve your brand and content visibility for Large Language Models (LLMs). These strategies are listed in order of importance:

  1. Reddit SEO/Content: This strategy specifically targets content posted within the Reddit platform. Reddit SEO focuses on including relevant topics within the title and body of your Reddit posts while simultaneously fostering engagement. The engagement is evaluated by the number of upvotes, comments, and shares a post receives. These posts are considered more valuable by Reddit’s internal algorithms, which in turn boosts their performance in search engine rankings and LLMs. Creating high-quality content that resonates with the Reddit community can lead to increased engagement, thus improving your brand’s visibility on the platform and potentially influencing the AI’s perception of your brand.

Here are a few things to keep in mind when getting started with Reddit SEO: 

1. Understand the Community: Each subreddit has its own unique culture and norms. Spend time understanding these norms and the type of content that resonates with the community. 

2. Create Valuable Content: Reddit users value high-quality, original content that contributes to the discussion. Create posts that offer unique insights or perspectives. 

3. Engagement is Key: Reddit SEO is not just about posting content but also about engaging with the community. Respond to comments on your posts and participate in other discussions. 

4. Optimize Your Titles: The title of your Reddit post is one of the most important factors for Reddit SEO. Make sure your title is descriptive and contains relevant keywords. 

5. Post at the Right Time: Reddit users are more active at certain times of the day. Research the best times to post to increase the visibility of your content. 

6. Use Links Wisely: While it’s allowed to include links in your posts, excessive self-promotion can lead to downvotes or even bans. Make sure any links you include are relevant and add value to your post.

  1. Wikipedia: Another effective strategy involves influencing Wikipedia pages related to your brand or industry. This can be accomplished by positioning your brand on various relevant pages or even creating dedicated product Wikipedia pages. This can significantly enhance your brand’s visibility on AI-powered searches. However, it’s worth noting that manipulating Wikipedia content can be a tricky and delicate process. We strongly recommend partnering with experienced Wikipedia authors or agencies to successfully boost your visibility on the platform. Undertaking this task without any Wikipedia history or expertise can potentially harm rather than enhance your efforts.

    To improve visibility on Wikipedia, brands can consider the following strategies:

    • Create a Brand Page: If your brand meets Wikipedia’s notability requirements, consider creating a dedicated Wikipedia page that details your brand’s history, products/services, and notable achievements. Ensure the information is verifiable, unbiased, and written in an encyclopedic style.
    • Edit Relevant Pages: Contribute to existing Wikipedia pages that are relevant to your industry or brand. This could include adding your brand as an example in a particular category or updating outdated information.
    • Cite Reliable Sources: Wikipedia values references that can verify the information provided. Ensure to cite reliable, third-party sources wherever possible to improve the credibility of the information related to your brand.
    • Follow Wikipedia Guidelines: Ensure to adhere strictly to Wikipedia’s guidelines for content addition and editing. This includes avoiding promotional language, respecting the neutrality of content, and not engaging in edit wars.
    • Engage Experienced Wikipedia Authors: Considering the platform’s strict guidelines and the potential backlash against perceived self-promotion, it might be beneficial to engage experienced Wikipedia authors or agencies to create or edit pages related to your brand.
  1. Books: Although creating and publishing your own books might seem like a daunting task, it can be a worthwhile long-term investment. By publishing your books via open-source licenses on the internet, you can effectively disseminate your brand’s message and values. However, it’s worth noting that this strategy requires a substantial amount of resources and time compared to strategies involving Reddit and Wikipedia. Therefore, while writing and publishing books can contribute to your brand visibility in the long run, we do not recommend it as the first step in your LLM optimization journey.

  2. Overall Web Content: The largest source of data for LLMs like GPT-3 is the Common Crawl, which is essentially a snapshot of the entire web. By creating more engaging content on the internet, you can ultimately impact your brand’s AI ranking. This strategy is similar to improving your Google search rankings, meaning that optimizing your brand for AI-powered searches isn’t drastically different from traditional SEO practices. However, it’s important to remember that your competition is also creating content daily. Therefore, companies that consistently produce high-quality content and implement effective marketing strategies will generally have better visibility on LLMs.

In summary, while the landscape of SEO is evolving with the advent of AI and LLMs, the core principles remain the same. Creating high-quality, engaging content that provides value to its audience is key to improving your brand’s visibility, whether it’s on traditional search engines or AI-powered platforms.

FREE TRIAL

Give Otterly.AI a free try today.

Otterly.AI is the new way in Share of Voice AI monitoring. Easily monitor your brand, competitors and keywords. And discover how visible your brand is. Sign up