Setting up an Intel Arc GPU on Arch Linux for local LLM use (without ReBAR)

Posted Nov 19, 2025 Updated Nov 20, 2025

By Stephen Bonner

11 min read

TLDR. I was able to run local LLMs using my Intel Arc A750 GPU on a system without ReBAR by using the Vulkan backend and by disabling FP16.

Since the launch of Intel’s Arc GPUs I’ve been interested in seeing how they perform when running local LLMs. I picked up an Arc A750 for testing and installed it in an older X99 system (which means no Resizable BAR, something that will come back to haunt me) running Arch Linux—this machine also doubles as my NAS. Getting this particular setup to run a local LLM ended up being quite an adventure. This blog details what I currently have working, as well as the things I tried along the way. I admit the use case is pretty niche: running local LLMs on an Intel Arc GPU without ReBAR. But it has been fun nevertheless.

What local LLM offerings are there for Arc?

From my research, as of November 2025, there are several ways to get local LLM’s running on intel Arc cards (these are the ones I am aware/tried, there may well be others):

IPEX-LLM: This is a project maintained by Intel which is built on their IPEX extension for PyTorch which allows llama.cpp, Ollama vLLM et al to run on Arc cards. They provide a docker image with all libraries and packages pre-configured. However as of October 2025, Intel seems to have retired IPEX and presumably IPEX-LLM as well (no official confirmation, but updates have slowed significantly).
llama-cpp: It is possible to compile llama-cpp to support Arc cards using two different backends: SYCL & Vulkan
LM Studio: A GUI app that uses llama-cpp under the hood and seems to have been compiled with Arc support. I wasn’t able to test this however as my Arc is running in a headless server.
vLLM: vLLM also has support for Arc, however this appears to be based on IPEX so again might have an uncertain future.
ollama: Ollama has just enabled support for a Vulkan backend as well! Although at the moment you still need to enable it separately with a environment variable as it is not on by default.

What didn’t work for me: IPEX-LLM and SYCL

IPEX-LLM

I initially tried the IPEX-LLM Docker container. Despite a promising start, I never managed to run any model. Some of my issues are detailed in this issue on the IPEX-LLM repo. Essentially, despite the GPU being detected, the system would seg-fault when attempting to load the model weights onto the card. I was ultimately told by someone from Intel that my CPU (an i7 6950X from 2016) was too old (fair enough I suppose!) and I gave up on this route.

llama-cpp with SYCL

I eventually decided to try building llama-cpp from source instead. After consulting their documentation, I decided first to try and compile with SYCL support. This was the first time I had hands-on experience with SYCL and Intel’s oneAPI ecosystem (it honestly took me quite a while to get up to speed with all of this nomenclature having historically only dealt with Nvidia cards - I might do a blog diving into just that topic) so I first needed to install dependencies.

Installing dependencies on Arch for SYCL

The main Intel Arc driver is open source and is included in an up-to-date Mesa install, making the setup on Arch quite easy. Many of the required SYCL/oneAPI/Level0 dependencies are nicely wrapped up in one AUR package: intel-oneapi-basekit so I installed that with a few extra packages: intel-compute-runtime (the GPU runtime) and intel-graphics-compiler (needed for shader compilation). I also installed intel-gpu-tools to get the useful intel_gpu_top command (a similar tool to nvidia-smi but for Intel world).

This can all be installed as follows (I use yay but any AUR manager would work):

yay -S intel-oneapi-basekit intel-compute-runtime intel-graphics-compiler intel-gpu-tools

Once this has all been installed, we can check that SYCL can detect the GPU correctly. To do this we must first run an included shell script to setup our env and then we can call sycl-ls.

source /opt/intel/oneapi/setvars.sh
sycl-ls

On my system I got the following output (Notice already the complaints about the small BAR):

WARNING: Small BAR detected for device 0000:03:00.0
WARNING: Small BAR detected for device 0000:03:00.0
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A750 Graphics 12.55.8 [1.13.35563]
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.18.12.0.05_160000]
[opencl:cpu][opencl:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [25.40.35563]

We can see that the A750 has been correctly identified as a level_zero:gpu. This means we can progress to actually installing and trying to use llama-cpp.

Compiling and testing llama-cpp on SYCL

llama-cpp has a good set of instructions to help compile with SYCL support. I compiled with the following two commands (run from the root of a copy of the llama-cpp repo):

  
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx 

  
cmake --build build --config Release -j -v

And that should be llama-cpp compiled and available in ./build/bin/ directory that should have appeared.

We can then try to test running a model. You can attempt to run a Qwen3-8B using the following command (this will pull the official GGUF from huggingface and try to offload all of the layers to the GPU):

./build/bin/llama-cli -hf Qwen/Qwen3-8B-GGUF:Q4_K_M -ngl 999

However this is where it all fell down for me. No matter what model, quantisation or parameters I use, llama-cpp will crash as it attempts to load the model weights onto the card. I see error messages like this:

load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type SYCL_Host, using CPU instead 
load_tensors: offloading 36 repeating layers to GPU 
load_tensors: offloading output layer to GPU 
load_tensors: offloaded 37/37 layers to GPU 
load_tensors: CPU_Mapped model buffer size = 333.84 MiB 
load_tensors: SYCL0 model buffer size = 4455.34 MiB 
.[1] 35822 bus error (core dumped)

This error is very similar to the one I was getting when attempting to use the IPEX-LLM docker image originally.

SYCL seems to require ReBAR in order to work

After doing a bit of research (there is really not much out there for people attempting to run SYCL+Arc on non-ReBAR systems) it seems that, as I feared, SYCL essentially seems to require that ReBAR is enabled in order for the CPU to able to address more than 256MB of VRAM. - and thus was likely the source of my issues here. Aside from loading a patched BIOS onto my ASUS X99 motherboard to enable ReBAR, this seems to be the end of the line for SYCL.

What has worked (sort of) for me

Vulkan

Not wanting to give in, I had seen on the llama-cpp repo that there is support) for the Vulkan API so I decided to try this route instead.

I want to acknowledge some of the links I read as research that I read before starting. A non-exhaustive list include:

Installing dependencies on Arch for Vulkan

As before, I will detail what packages I need to install to be able to get this to work. I would say Vulkan was easier to setup than SYCL. I installed packages for the vulkan run time, headers and tools: vulkan-icd-loader, vulkan-headers & vulkan-tools. Then mesa and the intel Vulkan driver: mesa & vulkan-intel. Finally tools needed for compiling the Vulkan compute shaders: shaderc, spirv-tools & glslang.

All of these can be installed as follows:

yay -S vulkan-icd-loader vulkan-headers vulkan-tools mesa vulkan-intel shaderc spirv-tools glslang

Once all of this has been installed we can see if Vulkan can detect the Arc GPU using the vulkaninfo command:

  
vulkaninfo | grep -i "intel"

Which should return something like this:

GPU id = 0 (Intel(R) Arc(tm) A750 Graphics (DG2))
GPU id = 0 (Intel(R) Arc(tm) A750 Graphics (DG2))
deviceName = Intel(R) Arc(tm) A750 Graphics (DG2)
driverID   = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
driverName = Intel open-source Mesa driver
VK_INTEL_shader_integer_functions2            : extension revision 1

From here it is onto rebuilding llama-cpp again but this time with Vulkan support.

Compiling and testing llama-cpp on Vulkan

As before, we can compile llama-cpp with two commands, changing out some of the arguments for Vulkan based ones:

  
cmake -B build -DGGML_VULKAN=1

cmake --build build --config Release

Again this is detailed in the llama-cpp Vulkan instructions.

So after all of this work can we finally run a model that will use the Arc GPU? Well let’s test it:

./build/bin/llama-cli -hf Qwen/Qwen3-8B-GGUF:Q4_K_M -ngl 999

We initially see some promising signs, the model appears to load and I am presented with the normal ability to input text as input to the LLM. We can see the Vulkan install has got further then the SYCL one:

llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(tm) A750 Graphics (DG2)) (0000:03:00.0) - 6868 MiB free
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   333.84 MiB
load_tensors:      Vulkan0 model buffer size =  4455.34 MiB

So let’s write a prompt and give it a test run:

> Write me a story about using an Intel Arc GPU for running local LLMs

/>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.>.

Ah. The model is only able to output random tokens in its response - how disappointing. I thought this might be a Qwen artefact but no matter what model I tried (Gemma 3, Llama 3 etc) the problem persisted. To see if was the GPU usage that was causing this, I moved the model onto the CPU (by setting -ngl 0) and was able to see real answers being generated – so something about running on the GPU was making the models fail.

FP-16, or lack thereof, to the rescue

At this point, I was willing to try anything to finally get this to work and scoured the llama-cp github for any hints of what could be the problem. At some point, I came across this issue which mentioned a similar problem to mine. In it, someone mentioned that disabling the use of FP16 fixed their issue (although presumably with some performance penalty). So I fired up llama-cpp again but this time disabling FP16:

  
GGML_VK_DISABLE_F16=1 \./build/bin/llama-cli -hf Qwen/Qwen3-8B-GGUF:Q4_K_M -ngl 999

And finally—sensible output!

> Write me a story about using an Intel Arc GPU for running local LLMs

Okay, here's a story about using an Intel Arc GPU for running local LLMs, aiming for a balance of evocative imagery and a slightly melancholic tone. It leans into the challenges and rewards of this particular setup.

---

The story is actually quite a good read, but that could just be me not thinking straight after getting this to finally work. I am not sure what the issue is that means we can’t use FP16 on the Arc card currently or if it is related to the older platform I am running. However there were several other people in the thread who could reproduce the issue (as of two weeks ago) so there is some hope it is a bug that will be patched on the Intel driver side. I also have no sense for how much performance this will cost.

I was able to confirm that the GPU was really being used by running the intel_gpu_top:

         Blitter    0.00% |intel-gpu-top: Intel Dg2 (Gen12) @ /dev/dri/card0 - 2400/2401 MHz;   0% RC6;     1044 irqs/s

         ENGINES     BUSY                                                              MI_SEMA MI_WAIT
       Render/3D   83.26% |████████████████████████████████████████████████▍         |      0%      0%
         Blitter    0.05% |▏                                                         |      0%      0%
           Video    0.00% |                                                          |      0%      0%
    VideoEnhance    0.00% |                                                          |      0%      0%
         Compute    0.00% |                                                          |      0%      0%

 PID      MEM      RSS   Render/3D     Blitter       Video    VideoEnhance    Compute    NAME
9512 1717988K 1711844K |█████████▍ ||           ||           ||           ||           | llama-cli

Interestingly all of the activity whilst running a model shows up under the “Render/3D” category, rather than the “Compute”. I suppose this makes sense as Vulkan compute runs as compute shaders.

Wrap up

Overall I am glad that I was finally able to get models to run on my Intel Arc A750 and I look forward to being able to test it further. I also hope the FP16 bug is sorted out at some point (if it is I will update this page). I hope Intel keeps producing new GPUs and that the software ecosystem around them also matures. Next I will benchmark this card with a range of models and compare to Nvidia & Apple M series chips that I have access to.

llm

This post is licensed under CC BY 4.0 by the author.