In Artificial Intelligence, the advent of Large Language Models (LLMs) has ushered in a wave of innovation, empowering users to unleash their productivity and creativity. However, their significant size often translates to substantial computational demands. Over time, Small Language Models (SLMs) have emerged, expanding our ability to engage with diverse natural and programming languages. Nevertheless, certain user inquiries demand greater precision and specialized knowledge beyond the scope of generalized language models.

Consequently, there's a growing need for tailored Small Language Models capable of rivalling LLM performance while mitigating runtime costs and ensuring a secure and easily manageable framework. While LLMs remain paramount for addressing complex tasks, Microsoft has spearheaded the development of a series of SLMs that retain many LLM capabilities but boast smaller sizes and are trained on more compact datasets.

26 Apr 2024

    Interested in the article or the service offering? Get in touch with us:

    What is an SLM?

    A Small Language Model (SLM) is tailored to excel in simpler tasks, offering boosted accessibility and user-friendliness for organizations operating with limited resources. Besides, they can be readily fine-tuned to align with specific requirements. Small language models are particularly well-suited for organizations aiming to develop applications capable of operating local devices instead of relying on cloud infrastructure. They are especially beneficial for tasks that do not necessitate extensive reasoning or immediate responses.

    Reasons to use SLMs

    Given the growing popularity and applicability of SLMs across various domains, particularly in areas like sustainability and the volume of data required for training, there are multiple reasons for employing them.

  • From a hardware perspective, utilizing SLMs is more cost-effective, requiring less computational power and memory. Moreover, their suitability for on-premises and on-device deployments enhances security measures.
  • Regarding usage, SLMs are compact language models, either trained or fine-tuned for specific domains or tasks. Consequently, they can incorporate specialized terminology and knowledge, from legal jargon to medical diagnoses, safeguarding intellectual property. Depending on the context, SLMs offer a more economical and efficient solution.
  • SLMs find applications across diverse sectors, encompassing healthcare, technology, and beyond. Their utility extends to text summarization, content generation, sentiment analysis, chatbot development, named entity recognition, spell checking, machine translation, and code generation.
  • What is Phi-3?

    Microsoft has a suite of small language models (SLMs) known as ‘Phi,’ demonstrating outstanding performance across various benchmarks. Microsoft’s recent release is Phi-3, a series of open AI models. The Phi-3 models represent a prototype of capability and cost-effectiveness among small language models (SLMs), exceeding models of equivalent and larger sizes across the spectrum of coding, language, reasoning, and mathematical standards. This launch broadens the array of high-calibre models accessible to customers, providing them with more practical options as they craft and construct generative AI applications.

    Phi-3-mini, a 3.8B language model, is accessible through Microsoft Azure AI Studio, Hugging Face, and Ollama. It is offered in two context-length variations—4K and 128K tokens. Notably, it is the first model within its category to support a context window of up to 128K tokens with minimal impact on quality. Furthermore, it is instruction-tuned, implying that it has been trained to comprehend and adhere to diverse instructions, mirroring natural human communication patterns. This ensures that the model is readily deployable straight out of the box. Phi-3-mini is available on Azure AI to leverage the deploy-eval-finetune toolchain, and it is also accessible on Ollama for developers to execute locally on their laptops.

    Features of Phi-3

    Phi-3 models exhibit distinctive superiority over language models of comparable and larger dimensions on key benchmarks, showcasing the following features:

  • Suitability for resource-constrained environments, including on-device and offline inference scenarios.
  • Performance optimization for latency-bound scenarios where rapid response times are paramount.
  • Tailored for use cases with simpler tasks.
  • Utilization of high-quality training data to ensure model accuracy.
  • Enhancement through comprehensive safety measures post-training, including reinforcement learning from human feedback (RLHF), automated testing and evaluations across numerous harm categories, and manual red-teaming.
  • Adherence to the Microsoft Responsible AI Standard, embodying a company-wide adherence to six fundamental principles: accountability, transparency, fairness, reliability and safety, privacy and security, and inclusiveness.
  • Snowflake meets Phi-3: Advantages  

    The key pain point about LLMs is the computing required to host and run them. Setting up a dozen GPUs to run models can be expensive and complex. There’s where Snowflake steps up. Snowflake’s compute pool option enables users to easily and quickly set up and manage compute clusters. Phi-3 comes into the picture because of its cost-effective GPU utilization.

    Can you imagine a situation where your language model only requires less than 3GB of GPU memory for inference? Well, now it’s possible, all thanks to Phi-3. It’s a state-of-the-art SLM that produces excellent results over GP3.5 and Mistral 8x7B, which are much bigger models. This opens the door for more cost-effective solutions to be brought up in the AI space. Add Snowflake for hosting; you have an excellent setup to host, test, and build AI applications. Read below how Beinex managed to run Phi-3 on Day 0 in Snowflake. 
    Figure 1: DocAI running on Phi-3

    Implementing Phi-3 on Snowflake: What Beinex Did and How Beinex Did it?

    Beinex has seamlessly integrated Phi-3 into Snowflake to help enterprises unlock their data’s full potential through advanced language processing capabilities and enhance decision-making with deeper insights. The integration facilitates Snowflake users to:

  • Capitalize on the efficiency of Phi-3 in environments with limited resources.
  • Ensure swift responses to queries and data requests, improving operational efficiency.
  • Achieve cost savings without compromising on model performance, optimizing resource allocation.
  • Boost the precision and dependability of their data-driven insights and predictions.
  • Leverage state-of-the-art AI capabilities while strictly adhering to data governance and compliance standards.
  • Here’s a detailed guide on implementing Phi-3 on Snowflake:

    Step 1: Create Necessary Objects  

    — Run by ACCOUNTADMIN to allow connecting to Hugging Face to download the model  
    — Stage to store LLM models
    CREATE STAGE <stagename> IF NOT EXISTS models

    — Stage to store YAML specs
    CREATE STAGE <stagename> IF NOT EXISTS specs

    — Image repository  

    — Compute pool to run containers  
      MIN_NODES = 1
      MAX_NODES = 1

    Step 2: Docker Image Code – ollama  

    FROM ollama/ollama

    RUN $(ollama serve > output.log 2>&1 &) && sleep 10 && ollama pull phi3 && pkill ollama && rm output.log

    ENTRYPOINT [“ollama”]
    CMD [“serve”] 

    Step 3: Tag and Push the Docker Image  

    docker tag ollama <SNOW_ORG-SNOW_ACCOUNT> respository /ollama
    docker push <SNOW_ORG-SNOW_ACCOUNT> db/schema/image repository /ollama 

    Step 4: Docker Image – UDF  

    FROM python:3.11

    WORKDIR /app
    ADD ./requirements.txt /app/

    RUN pip install –no-cache-dir -r requirements.txt

    ADD ./ /app

    EXPOSE 5000

    CMD [“flask”, “run”, “–host=”] content is given below : 

    from flask import Flask, request, Response, jsonify
    import logging
    import re
    import os
    from openai import OpenAI

    client = OpenAI(

    model = “phi3”

    app = Flask(__name__)

    def extract_json_from_string(s):”Extracting JSON from string: {s}”)
        # Use a regular expression to find a JSON-like string  
        matches = re.findall(r”\{[^{}]*\}”, s)

        if matches:
            # Return the first match (assuming there’s only one JSON object embedded)  
            return matches[0]

        # Return the original string if no JSON object is found  
        return s

    @app.route(“/”, methods=[“POST”])
    def udf():
            request_data: dict = request.get_json(force=True)  # type: ignore  
            return_data = []

            for index, col1 in request_data[“data”]:
                completion =
                            “role”: “system”,  
                            “content”: “You are a bot to help extract data and should give professional responses”,
                        {“role”: “user”, “content”: col1},
                    [index, extract_json_from_string(completion.choices[0].message.content)]

            return jsonify({“data”: return_data})
        except Exception as e:
            return jsonify(str(e)), 500  

    Step 6: YAML File  

        name: ollama  
          image: <SNOW_ORG-SNOW_ACCOUNT> db/schema/image respository /Phi3  
            NUM_GPU: 1  
            MAX_GPU_MEMORY: 24Gib  
            name: llm-workspace  
              mountPath: /<stage name>  
       name: udf  
          image: <SNOW_ORG-SNOW_ACCOUNT> db/schema/image respository /ollama_udf
        name: chat  
          port: 5000  
          public: false
        name: llm  
          port: 11434  
          public: false
       name: llm-workspace  
          source:“@<llm stage_name> 

    Step 7: Upload YAML File and Create Service 

    Upload the YAML file to the created stage, where the stage name in the YAML file should match the stage created in Step 2. 

    — Create service  
    create service phi3
    IN COMPUTE POOL <name of compute pool created>
    FROM @dash_stage
    SPECIFICATION_FILE = ‘<name of yaml file uploaded>’; 

    Step 8: Create Service Function  

    Create a service function on the service (after it starts). 

    create or replace function phi3chat(prompt text)
    returns text
    service= phi3

    Check Service Status  

    Use the following command to check the status of the service: 

      v.value:containerName::varchar container_name, 
      v.value:status::varchar status, 
      v.value:message::varchar message 
    FROM ( 
    SELECT parse_json(system$get_service_status(‘<service name>’)) 
    ) t,  
    LATERAL FLATTEN(input => t.$1) v;  

    Benefits of Running Phi-3 on Snowflake

    1. Cost-Effectiveness and Efficiency:

  • Phi-3’s efficacy as a model, coupled with Snowflake’s cloud-based infrastructure, ensures cost-effective operations.
  • Running Phi-3 on Snowflake minimizes operational expenses by optimizing resources and maximizing performance while delivering exceptional results.
  • 2. Compatibility with Smaller GPUs:

  • Phi-3’s versatility allows it to run efficiently on smaller GPUs, making it accessible to a wider range of users.
  • This flexibility enables organizations to leverage existing hardware infrastructure without needing extensive upgrades.
  • 3. Exceptional Performance:

  • Despite being an SLM, Phi-3 outperforms both SLMs and Large Language Models (LLMs) available in the market.
  • This superior performance translates to more accurate and reliable user outcomes, enhancing productivity and decision-making.
  • 4. Faster Response Times:

  • Even in resource-constrained environments, Phi-3’s capability to deliver faster responses is particularly advantageous within Snowflake’s ecosystem.
  • This integration ensures swift data processing and analysis by minimizing latency and maximizing throughput, enabling users to make informed real-time decisions.
  • SLM vs LLM

    The choice between small and large language models hinges on organizational needs, task complexity, and resource availability.

    LLMs excel in applications requiring the orchestration of intricate tasks, encompassing advanced reasoning, data analysis, and contextual comprehension.

    On the other hand, SLMs present viable options for regulated industries and sectors facing scenarios necessitating top-tier results while maintaining data within their premises.

    Both large and small language models possess distinct strengths and applications. While large language models thrive in managing complex workflows, small language models deliver impressive performance despite their compact size.

    While some customers may exclusively require small models, others may favour larger models, with many seeking to integrate both types in various configurations. Ultimately, the optimal selection depends on the unique context and objectives of the organization. Besides transitioning from large to small models, the trend is evolving towards a diversified portfolio of models. This means that instead of relying on a single model, customers can choose from various models with different sizes, capabilities, and resource requirements. This empowers customers to decide the best model for their scenario, balancing performance and resource constraints.