English version follows.
Aujourd’hui, Khronos Group a
sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet
Asahi Linux est fier d’annoncer le
premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre
pilote graphique Honeykrisp
est reconnu
par Khronos comme conforme à cette nouvelle version dès
aujourd’hui.
Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir
installé Fedora Asahi Remix, executez
dnf upgrade --refresh pour
obtenir la dernière version du pilote.
Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y
compris les horodatages et la lecture locale avec le rendu dynamique.
L’industrie suppose que ces fonctionnalités devront être plus courantes,
et nous y sommes préparés.
Sortir un pilote conforme reflète notre engagement en faveur des
standards graphiques et du logiciel libre. Asahi Linux est aussi
compatible avec OpenGL
4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications
pertinentes. D’ailleurs, les nôtres sont les seuls pilotes conformes
pour le materiel d’Apple de n’importe quel standard graphique.
Même si le pilote est sorti, il faut encore compiler une version
expérimentale de Vulkan-Loader pour utiliser la nouvelle version de
Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponibles
comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de
suite.
Pour plus d’informations, consultez
l’article du blog de Khronos.
Today, the Khronos Group
released the 1.4 specification of Vulkan, the standard graphics API. The
Asahi Linux project is proud to
announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp
driver is Khronos-recognized
as conformant to the new version since day one.
That driver is already available in our official repositories. After
installing Fedora Asahi Remix, run dnf
upgrade --refresh to get the latest drivers.
Vulkan 1.4 standardizes several important features, including
timestamps and dynamic rendering local read. The industry expects that
these features will become more common, and we are prepared.
Releasing a conformant driver reflects our commitment to graphics
standards and software freedom. Asahi Linux is also compatible with
OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the
relevant specifications. For that matter, ours are the only conformant
drivers on Apple hardware for any graphics standard.
Although the driver is released, you still need to build an
experimental version of Vulkan-Loader to access the new Vulkan version.
Nevertheless, you can immediately use all the new features as extensions
in our Vulkan 1.3 driver.
For more information, see
the Khronos blog post.
Hi all!
This month I’ve spent a lot of time triaging Sway and wlroots issues following
the Sway 1.10 release. There are a few regressions, some of which are already
fixed (thanks to all contributors for sending patches!). Kenny has added
support for software-only secondary KMS devices such as GUD and DisplayLink.
David Turner from Raspberry Pi has contributed crop and scale support for
output buffers, that way video players are more likely to hit direct scan-out.
I’ve added support for explicit sync in the Wayland backend for nested
compositors.
I’ve worked a bit on the Goguma mobile IRC client. The auto-complete dropdown
now shows user display names, channel topics and command descriptions.
Additionally, commands which don’t make sense given the current context are
hidden (for instance, /part is not displayed in a conversation with a single
user).
The gamja Web IRC client should now reconnect more quickly after regaining
connectivity. For instance, after resume from suspend, gamja now reconnects
immediately instead of waiting 10 seconds. Thanks to Matteo, soju-containers
now ships arm64 images.
The NPotM is sogogi, a simple
WebDAV file server. It’s quite minimal for now: a list of directories to
serve is defined in the configuration file, as well as users and access lists.
In the future, I’d like to add external authentication (e.g. via PAM or via
another HTTP server), HTML directory listings and configuration file reload.
That’s all for now! Once again, that’s a pretty short status update. A lot of
my time goes into more boring maintenance tasks and reviews. See you next month!
XDC 2024 in Montreal was another fantastic gathering for the Linux Graphics
community. It was again a great time to immerse in the world of graphics
development, engage in stimulating conversations, and learn from inspiring
developers.
Many Igalia colleagues and I participated in the
conference again, delivering multiple talks about our work on the Linux
Graphics stack and also organizing the Display/KMS meeting. This blog post is a
detailed report on the Display/KMS meeting held during this XDC edition.
Short on Time?
Catch the lightning talk summarizing the meeting here (you can even speed up
2x):
For a quick written summary, scroll down to the TL;DR section.
TL;DR
This meeting took 3 hours and tackled a variety of topics related to DRM/KMS
(Linux/DRM Kernel Modesetting):
Sharing Drivers Between V4L2 and KMS: Brainstorming solutions for using a
single driver for devices used in both camera capture and display pipelines.
Real-Time Scheduling: Addressing issues with non-blocking page flips
encountering sigkills under real-time scheduling.
HDR/Color Management: Agreement on merging the current proposal, with
NVIDIA implementing its special cases on VKMS and adding missing parts on top
of Harry Wentland’s (AMD) changes.
Display Mux: Collaborative design discussions focusing on compositor
control and cross-sync considerations.
Better Commit Failure Feedback: Exploring ways to equip compositors with
more detailed information for failure analysis.
Bringing together Linux display developers in the XDC 2024
While I didn’t present a talk this year, I co-organized a Display/KMS meeting (with
Rodrigo Siqueira of AMD) to build upon the momentum from the 2024 Linux
Display Next hackfest.
The meeting was attended by around 30 people in person and 4 remote participants.
Speakers: Melissa Wen (Igalia) and Rodrigo Siqueira (AMD)
Link: https://indico.freedesktop.org/event/6/contributions/383/
Topics: Similar to the hackfest, the meeting agenda was built over the
first two days of the conference and mixed talks follow-up with new ideas and
ongoing community efforts.
The final agenda covered five topics in the scheduled order:
How to share drivers between V4L2 and DRM for bridge-like components (new
topic);
Real-time Scheduling (problems encountered after the Display Next hackfest);
HDR/Color Management (ofc);
Display Mux (from Display hackfest and XDC 2024 talk, bringing AMD and
NVIDIA together);
(Better) Commit Failure Feedback (continuing the last minute topic of the
Display Next hackfest).
Unpacking the Topics
Similar to the hackfest, the meeting agenda evolved over the conference.
During the 3 hours of meeting, I coordinated the room and discussion rounds,
and Rodrigo Siqueira took notes and also contacted key developers to provide a
detailed report of the many topics discussed.
From his notes, let’s dive into the key discussions!
How to share drivers between V4L2 and KMS for bridge-like components.
Led by Laurent Pinchart, we delved into the challenge of creating a unified
driver for hardware devices (like scalers) that are used in both camera capture
pipelines and display pipelines.
Problem Statement: How can we design a single kernel driver to handle
devices that serve dual purposes in both V4L2 and DRM subsystems?
Potential Solutions:
Multiple Compatible Strings: We could assign different compatible
strings to the device tree node based on its usage in either the camera or
display pipeline. However, this approach might raise concerns from device tree
maintainers as it could be seen as a layer violation.
Separate Abstractions: A single driver could expose the device to both
DRM and V4L2 through separate abstractions: drm-bridge for DRM and V4L2 subdev
for video. While simple, this approach requires maintaining two different
abstractions for the same underlying device.
Unified Kernel Abstraction: We could create a new, unified kernel
abstraction that combines the best aspects of drm-bridge and V4L2 subdev. This
approach offers a more elegant solution but requires significant design effort
and potential migration challenges for existing hardware.
Real-Time Scheduling Challenges
We have discussed real-time scheduling during this year Linux Display Next
hackfest and,
during the XDC 2024, Jonas Adahl brought up issues uncovered while
progressing on this front.
Context: Non-blocking page-flips can, on rare occasions, take a long time
and, for that reason, get a sigkill if the thread doing the atomic commit is
a real-time schedule.
Action items:
Explore alternative backtraces during the busy wait (e.g., ftrace).
Investigate the maximum thread time in busy wait to reproduce issues faced
by compositors. Tools like RTKit (mutter) can be used for better control
(Michel Dänzer can help with this setup).
HDR/Color Management
This is a well-known topic with ongoing effort on all layers of the Linux
Display stack and has been discussed online and in-person in conferences and
meetings over the last years.
Here’s a breakdown of the key points raised at this meeting:
Talk: Color operations for Linux color pipeline on AMD
devices: In the previous day,
Alex Hung (AMD) presented the implementation of this API on AMD display driver.
NVIDIA Integration: While they agree with the overall proposal, NVIDIA
needs to add some missing parts. Importantly, they will implement these on
top of Harry Wentland’s (AMD) proposal. Their specific requirements will be
implemented on VKMS (Virtual Kernel Mode Setting driver) for further
discussion. This VKMS implementation can benefit compositor developers by
providing insights into NVIDIA’s specific needs.
Other vendors: There is a version of the KMS API applied on Intel color
pipeline. Apart from that, other vendors appear to be comfortable with the
current proposal but lacks the bandwidth to implement it right now.
Upstream Patches: The relevant upstream patches were can be found
here.
[As humorously notes, this series is eagerly awaiting your “Acked-by”
(approval)]
Compositor Side: The compositor developers have also made significant
progress.
KDE has already implemented and validated the API through an experimental
implementation in
Kwin.
Gamescope currently uses a driver-specific implementation but has
a draft that utilizes
the generic version. However, some work is still required to fully transition
away from the driver-specific approach.
AP: work on porting gamescope to KMS generic API
Weston has also begun exploring implementation, and we might see something
from them by the end of the year.
Kernel and Testing: The kernel API proposal is well-refined and meets the
DRM subsystem requirements. Thanks to Harry Wentland effort, we already
have the API attached to two hardware vendors
and IGT tests, and,
thanks to Xaver Hugl, a compositor implementation in place.
Finally, there was a strong sense of agreement that the current proposal for
HDR/Color Management is ready to be merged. In simpler terms, everything seems
to be working well on the technical side - all signs point to merging and
“shipping” the DRM/KMS plane color management API!
Display Mux
During the meeting, Daniel Dadap led a brainstorming session on the
design of the display mux switching sequence, in which the compositor would arm
the switch via sysfs, then send a modeset to the outgoing driver, followed by a
modeset to the incoming driver.
Context:
During this year Linux Display Next hackfest, Mario Limonciello (AMD)
introduced the topic and led a discussion on Display Mux.
Daniel Dadap (NVIDIA) retook this discussion with the
XDC 2024 talk: Dynamic Switching of Display Muxes on Hybrid GPU Systems.
Key Considerations:
HPD Handling: There was a general consensus that disabling HPD can be
part of the sequence for internal panels and we don’t need to focus on it
here.
Cross-Sync: Ensuring synchronization between the compositor and the
drivers is crucial. The compositor should act as the “drm-master” to
coordinate the entire sequence, but how can this be ensured?
Future-Proofing: The design should not assume the presence of a mux. In
future scenarios, direct sharing over DP might be possible.
Action points:
Sharing DP AUX: Explore the idea of sharing DP AUX and its implications.
Backlight: The backlight definition represents a problem in the mux
switch context, so we should explore some of the current specs available
for that.
Towards Better Commit Failure Feedback
In the last part of the meeting, Xaver Hugl asked for better commit failure
feedback.
Problem description: Compositors currently face challenges in collecting
detailed information from the kernel about commit failures. This lack of
granular data hinders their ability to understand and address the root causes
of these failures.
To address this issue, we discussed several potential improvements:
Direct Kernel Log Access: One idea is to directly load relevant kernel logs
into the compositor. This would provide more detailed information about the
failure and potentially aid in debugging.
Finer-Grained Failure Reporting: We also explored the possibility of
separating atomic failures into more specific categories. Not all failures
are critical, and understanding the nature of the failure can help compositors
take appropriate action.
Enhanced Logging: Currently, the dmesg log doesn’t provide enough
information for user-space validation. Raising the log level to capture more
detailed information during failures could be a viable solution.
By implementing these improvements, we aim to equip compositors with the
necessary tools to better understand and resolve commit failures, leading to a
more robust and stable display system.
A Big Thank You!
Huge thanks to Rodrigo Siqueira for these detailed meeting notes. Also, Laurent
Pinchart, Jonas Adahl, Daniel Dadap, Xaver Hugl, and Harry Wentland for
bringing up interesting topics and leading discussions. Finally, thanks to all
the participants who enriched the discussions with their experience, ideas, and
inputs, especially Alex Goins, Antonino Maniscalco, Austin Shafer, Daniel
Stone, Demi Obenour, Jessica Zhang, Joan Torres, Leo Li, Liviu Dudau, Mario
Limonciello, Michel Dänzer, Rob Clark, Simon Ser and Teddy Li.
This collaborative effort will undoubtedly contribute to the continued
development of the Linux display stack.
Stay tuned for future updates!
A while ago I was looking at Rust-based parsing of HID reports but, surprisingly, outside of C wrappers and the usual cratesquatting I couldn't find anything ready to use. So I figured, why not write my own, NIH style. Yay! Gave me a good excuse to learn API design for Rust and whatnot. Anyway, the result of this effort is the hidutils collection of repositories which includes commandline tools like hid-recorder and hid-replay but, more importantly, the hidreport (documentation) and hut (documentation) crates. Let's have a look at the latter two.
Both crates were intentionally written with minimal dependencies, they currently only depend on thiserror and arguably even that dependency can be removed.
HID Usage Tables (HUT)
As you know, HID Fields have a so-called "Usage" which is divided into a Usage Page (like a chapter) and a Usage ID. The HID Usage tells us what a sequence of bits in a HID Report represents, e.g. "this is the X axis" or "this is button number 5". These usages are specified in the HID Usage Tables (HUT) (currently at version 1.5 (PDF)). The hut crate is generated from the official HUT json file and contains all current HID Usages together with the various conversions you will need to get from a numeric value in a report descriptor to the named usage and vice versa. Which means you can do things like this:
let gd_x = GenericDesktop::X;
let usage_page = gd_x.usage_page();
assert!(matches!(usage_page, UsagePage::GenericDesktop));
Or the more likely need: convert from a numeric page/id tuple to a named usage.
let usage = Usage::new_from_page_and_id(0x1, 0x30); // GenericDesktop / X
println!("Usage is {}", usage.name());
90% of this crate are the various conversions from a named usage to the numeric value and vice versa. It's a huge crate in that there are lots of enum values but the actual functionality is relatively simple.
hidreport - Report Descriptor parsing
The hidreport crate is the one that can take a set of HID Report Descriptor bytes obtained from a device and parse the contents. Or extract the value of a HID Field from a HID Report, given the HID Report Descriptor. So let's assume we have a bunch of bytes that are HID report descriptor read from the device (or sysfs) we can do this:
let rdesc: ReportDescriptor = ReportDescriptor::try_from(bytes).unwrap();
I'm not going to copy/paste the code to run through this report descriptor but suffice to day it will give us access to the input, output and feature reports on the device together with every field inside those reports. Now let's read from the device and parse the data for whatever the first field is in the report (this is obviously device-specific, could be a button, a coordinate, anything):
let input_report_bytes = read_from_device();
let report = rdesc.find_input_report(&input_report_bytes).unwrap();
let field = report.fields().first().unwrap();
match field {
Field::Variable(var) => {
let val: u32 = var.extract(&input_report_bytes).unwrap().into();
println!("Field {:?} is of value {}", field, val);
},
_ => {}
}
The full documentation is of course on docs.rs and I'd be happy to take suggestions on how to improve the API and/or add features not currently present.
hid-recorder
The hidreport and hut crates are still quite new but we have an existing test bed that we use regularly. The venerable hid-recorder tool has been rewritten twice already. Benjamin Tissoires' first version was in C, then a Python version of it became part of hid-tools and now we have the third version written in Rust. Which has a few nice features over the Python version and we're using it heavily for e.g. udev-hid-bpf debugging and development. An examle output of that is below and it shows that you can get all the information out of the device via the hidreport and hut crates.
$ sudo hid-recorder /dev/hidraw1
# Microsoft Microsoft® 2.4GHz Transceiver v9.0
# Report descriptor length: 223 bytes
# 0x05, 0x01, // Usage Page (Generic Desktop) 0
# 0x09, 0x02, // Usage (Mouse) 2
# 0xa1, 0x01, // Collection (Application) 4
# 0x05, 0x01, // Usage Page (Generic Desktop) 6
# 0x09, 0x02, // Usage (Mouse) 8
# 0xa1, 0x02, // Collection (Logical) 10
# 0x85, 0x1a, // Report ID (26) 12
# 0x09, 0x01, // Usage (Pointer) 14
# 0xa1, 0x00, // Collection (Physical) 16
# 0x05, 0x09, // Usage Page (Button) 18
# 0x19, 0x01, // UsageMinimum (1) 20
# 0x29, 0x05, // UsageMaximum (5) 22
# 0x95, 0x05, // Report Count (5) 24
# 0x75, 0x01, // Report Size (1) 26
... omitted for brevity
# 0x75, 0x01, // Report Size (1) 213
# 0xb1, 0x02, // Feature (Data,Var,Abs) 215
# 0x75, 0x03, // Report Size (3) 217
# 0xb1, 0x01, // Feature (Cnst,Arr,Abs) 219
# 0xc0, // End Collection 221
# 0xc0, // End Collection 222
R: 223 05 01 09 02 a1 01 05 01 09 02 a1 02 85 1a 09 ... omitted for previty
N: Microsoft Microsoft® 2.4GHz Transceiver v9.0
I: 3 45e 7a5
# Report descriptor:
# ------- Input Report -------
# Report ID: 26
# Report size: 80 bits
# | Bit: 8 | Usage: 0009/0001: Button / Button 1 | Logical Range: 0..=1 |
# | Bit: 9 | Usage: 0009/0002: Button / Button 2 | Logical Range: 0..=1 |
# | Bit: 10 | Usage: 0009/0003: Button / Button 3 | Logical Range: 0..=1 |
# | Bit: 11 | Usage: 0009/0004: Button / Button 4 | Logical Range: 0..=1 |
# | Bit: 12 | Usage: 0009/0005: Button / Button 5 | Logical Range: 0..=1 |
# | Bits: 13..=15 | ######### Padding |
# | Bits: 16..=31 | Usage: 0001/0030: Generic Desktop / X | Logical Range: -32767..=32767 |
# | Bits: 32..=47 | Usage: 0001/0031: Generic Desktop / Y | Logical Range: -32767..=32767 |
# | Bits: 48..=63 | Usage: 0001/0038: Generic Desktop / Wheel | Logical Range: -32767..=32767 | Physical Range: 0..=0 |
# | Bits: 64..=79 | Usage: 000c/0238: Consumer / AC Pan | Logical Range: -32767..=32767 | Physical Range: 0..=0 |
# ------- Input Report -------
# Report ID: 31
# Report size: 24 bits
# | Bits: 8..=23 | Usage: 000c/0238: Consumer / AC Pan | Logical Range: -32767..=32767 | Physical Range: 0..=0 |
# ------- Feature Report -------
# Report ID: 18
# Report size: 16 bits
# | Bits: 8..=9 | Usage: 0001/0048: Generic Desktop / Resolution Multiplier | Logical Range: 0..=1 | Physical Range: 1..=12 |
# | Bits: 10..=11 | Usage: 0001/0048: Generic Desktop / Resolution Multiplier | Logical Range: 0..=1 | Physical Range: 1..=12 |
# | Bits: 12..=15 | ######### Padding |
# ------- Feature Report -------
# Report ID: 23
# Report size: 16 bits
# | Bits: 8..=9 | Usage: ff00/ff06: Vendor Defined Page 0xFF00 / Vendor Usage 0xff06 | Logical Range: 0..=1 | Physical Range: 1..=12 |
# | Bits: 10..=11 | Usage: ff00/ff0f: Vendor Defined Page 0xFF00 / Vendor Usage 0xff0f | Logical Range: 0..=1 | Physical Range: 1..=12 |
# | Bit: 12 | Usage: ff00/ff04: Vendor Defined Page 0xFF00 / Vendor Usage 0xff04 | Logical Range: 0..=1 | Physical Range: 0..=0 |
# | Bits: 13..=15 | ######### Padding |
##############################################################################
# Recorded events below in format:
# E: . [bytes ...]
#
# Current time: 11:31:20
# Report ID: 26 /
# Button 1: 0 | Button 2: 0 | Button 3: 0 | Button 4: 0 | Button 5: 0 | X: 5 | Y: 0 |
# Wheel: 0 |
# AC Pan: 0 |
E: 000000.000124 10 1a 00 05 00 00 00 00 00 00 00
Some days ago I wrote about the new VK_EXT_device_generated_commands Vulkan extension that had just been made public.
Soon after that, I presented a talk at XDC 2024 with a brief introduction to it.
It’s a lightning talk that lasts just about 7 minutes and you can find the embedded video below, as well as the slides and the talk transcription if you prefer written formats.
Truth be told, the topic deserves a longer presentation, for sure.
However, when I submitted my talk proposal for XDC I wasn’t sure if the extension was going to be public by the time XDC would take place.
This meant I had two options: if I submitted a half-slot talk and the extension was not public, I needed to talk for 15 minutes about some general concepts and a couple of NVIDIA vendor-specific extensions: VK_NV_device_generated_commands and VK_NV_device_generated_commands_compute.
That would be awkward so I went with a lighning talk where I could talk about those general concepts and, maybe, talk about some VK_EXT_device_generated_commands specifics if the extension was public, which is exactly what happened.
Fortunately, I will talk again about the extension at Vulkanised 2025.
It will be a longer talk and I will cover the topic in more depth.
See you in Cambridge in February and, for those not attending, stay tuned because Vulkanised talks are recorded and later uploaded to YouTube.
I’ll post the link here and in social media once it’s available.
XDC 2024 recording
Talk slides and transcription
Hello, I’m Ricardo from Igalia and I’m going to talk about Device-Generated Commands in Vulkan.
This is a new extension that was released a couple of weeks ago.
I wrote CTS tests for it, I helped with the spec and I worked with some actual heros, some of them present in this room, that managed to get this implemented in a driver.
Device-Generated Commands is an extension that allows apps to go one step further in GPU-driven rendering because it makes it possible to write commands to a storage buffer from the GPU and later execute the contents of the buffer without needing to go through the CPU to record those commands, like you typically do by calling vkCmd functions working with regular command buffers.
It’s one step ahead of indirect draws and dispatches, and one step behind work graphs.
Getting away from Vulkan momentarily, if you want to store commands in a storage buffer there are many possible ways to do it.
A naïve approach we can think of is creating the buffer as you see in the slide.
We assign a number to each Vulkan command and store it in the buffer.
Then, depending on the command, more or less data follows.
For example, lets take the sequence of commands in the slide: (1) push constants followed by (2) dispatch.
We can store a token number or command id or however you want to call it to indicate push constants, then we follow with meta-data about the command (which is the section in green color) containing the layout, stage flags, offset and size of the push contants.
Finally, depending on the size, we store the push constant values, which is the first chunk of data in blue.
For the dispatch it’s similar, only that it doesn’t need metadata because we only want the dispatch dimensions.
But this is not how GPUs work.
A GPU would have a very hard time processing this.
Also, Vulkan doesn’t work like this either.
We want to make it possible to process things in parallel and provide as much information in advance as possible to the driver.
So in Vulkan things are different.
The buffer will not contain an arbitrary sequence of commands where you don’t know which one comes next.
What we do is to create an Indirect Commands Layout.
This is the main concept.
The layout is like a template for a short sequence of commands.
We create this layout using the tokens and meta-data that we saw colored red and green in the previous slide.
We specify the layout we will use in advance and, in the buffer, we ony store the actual data for each command.
The result is that the buffer containing commands (lets call it the DGC buffer) is divided into small chunks, called sequences in the spec, and the buffer can contain many such sequences, but all of them follow the layout we specified in advance.
In the example, we have push constant values of a known size followed by the dispatch dimensions. Push constant values, dispatch. Push constant values, dispatch. Etc.
The second thing Vulkan does is to severely limit the selection of available commands.
You can’t just start render passes or bind descriptor sets or do anything you can do in a regular command buffer.
You can only do a few things, and they’re all in this slide.
There’s general stuff like push contants, stuff related to graphics like draw commands and binding vertex and index buffers, and stuff to dispatch compute or ray tracing work.
That’s it.
Moreover, each layout must have one token that dispatches work (draw, compute, trace rays) but you can only have one and it must be the last one in the layout.
Something that’s optional (not every implementation is going to support this) is being able to switch pipelines or shaders on the fly for each sequence.
Summing up, in implementations that allow you to do it, you have to create something new called Indirect Execution Sets, which are groups or arrays of pipelines that are more or less identical in state and, basically, only differ in the shaders they include.
Inside each set, each pipeline gets an index and you can change the pipeline used for each sequence by (1) specifying the Execution Set in advance (2) using an execution set token in the layout, and (3) storing a pipeline index in the DGC buffer as the token data.
The summary of how to use it would be:
First, create the commands layout and, optionally, create the indirect execution set if you’ll switch pipelines and the driver supports that.
Then, get a rough idea of the maximum number of sequences that you’ll run in a single batch.
With that, create the DGC buffer, query the required preprocess buffer size, which is an auxiliar buffer used by some implementations, and allocate both.
Then, you record the regular command buffer normally and specify the state you’ll use for DGC.
This also includes some commands that dispatch work that fills the DGC buffer somehow.
Finally, you dispatch indirect work by calling vkCmdExecuteGeneratedCommandsEXT.
Note you need a barrier to synchronize previous writes to the DGC buffer with reads from it.
You can also do explicit preprocessing but I won’t go into detail here.
That’s it.
Thank for watching, thanks Valve for funding a big chunk of the work involved in shipping this, and thanks to everyone who contributed!
Several months have passed since the last update. This has been in part due to the summer holidays and a gig doing some non-upstream work, but I have also had the opportunity to continue my work on the NPU driver for the VeriSilicon NPU in the NXP i.MX 8M Plus SoC, thanks to my friends at Ideas on Board.CC BY-NC 4.0 Henrik Boye I'm very happy with what has been accomplished so far, with the first concrete result being the merge in Mesa of the support for NXP's SoC. Thanks to Philipp Zabel and Christian Gmeiner for helping with their ideas and code reviews.With this, as of yesterday, one can accelerate models such as SSDLite MobileDet on that SoC with only open source software, with the support being provided directly from projects that are already ubiquitous in today's products, such as the Linux kernel and Mesa3D. We can expect this functionality to reach distributions such as Debian in due time, for seamless installation and integration in products.With this milestone reached, I will be working on expanding support for more models, with a first goal of enabling YOLO-like models, starting with YOLOX. I will be working as well on performance, as currently we are not fully using the capabilities of this hardware.
I’m a big supporter of finding problems before they get into the code base. The earlier you catch issues, the easier they are to fix. One of the main tools that helps with this is a Continuous Integration (CI) farm. A CI farm allows you to run extensive tests like deqp or piglit on a merge request or even on a private git branch before any code is merged, which significantly helps catch problems early.
I’m not the first one at Igalia to think this is really important. We already have a large Raspberry Pi board farm available on freedesktop’s GitLab instance that serves as a powerful tool for validating changes before they hit the main branch.
For a while, however, the etnaviv board farm has been offline. The main reason? I needed to clean up the setup: re-house it in a proper rack, redo all the wiring, and add more devices. What initially seemed like a few days’ worth of work spiraled into months of delay, mostly because I wanted to transition to using ci-tron.
Getting Familiar with the Ci-Tron Setup
Before diving into my journey, let’s quickly cover what makes up a ci-tron board farm.
Ci-Tron Gateway: This component is the central hub that manages devices.
PDU (Power Delivery Unit): A PDU is a device that manages the electrical power distribution to all the components in the CI farm. It allows you to remotely control the power, including power cycling devices, which is crucial for automating device management.
DUT (Device Under Test): The heart of the CI farm—these are the devices where the actual testing happens.
The Long Road to a Working Farm
Over the past few months, I’ve been slowly preparing for the big ci-tron transition. The first step was ensuring my PDU was compatible. It wasn’t initially supported, but after some hacking, I got it working and submitted a merge request (MR). After a few rounds of revisions, it was merged, expanding ci-tron’s PDU support significantly.
The next and most critical step was getting a DUT to boot up correctly. Initially, ci-tron only supported iPXE as a boot method, but my devices are using U-Boot. I tried to make it work anyway, but the network initialization failed too often, and I found myself sinking hours into debugging.
Thankfully, rudimentary support for a U-Boot based boot flow was eventually added. After some tweaks, I managed to get my DUTs booting — but not without complications. A major problem was getting the correct Device Tree Blob (DTB) to load, which was needed for ci-tron’s training rounds. A Device Tree Blob (DTB) is a binary representation of the hardware layout of a device. The DTB is used by the Linux kernel to understand the hardware configuration, including components like the CPU, memory, and peripherals. In my case, ensuring that the correct DTB was provided was crucial for the DUT to boot and be correctly managed by ci-tron. While integrating the DTB into U-Boot was suggested, it wasn’t ideal. Updating the bootloader just to change a DTB is cumbersome, especially with multiple devices in the farm.
With the booting issue taking up too much time, I decided to put it on hold and focus on something else: gfxinfo.
Gfxinfo Integration Challenges
gfxinfo is a neat feature that automatically tags a DUT based on the GPU model in the system, avoiding the need for manually assigning tags like gc2000. In theory, it’s very convenient—but in practice, there were hurdles.
gfxinfo tags Vivante GPUs using the device tree node information. However, since Vivante GPUs are quite generic, they don’t have a specific model property that uniquely identifies them. The plan was to pull this information using ioctl() calls to the etnaviv kernel driver. It took a lot of back and forth in review due to the internal gfxinfo API being under-documented, but after a lot of effort, I finally got the necessary code merged. You can find all of it in this MR.
Final Push: Getting Everything to Boot
There was still one major obstacle — getting the DUT to boot reliably. Luckily, mupuf was already working on it and made a significant MR with over 80 patches to address the boot issues. Introducing “boots db,” a feature designed to decouple the boot process, granting full control over DHCP, TFTP, and HTTP servers to each job. This is paired with YAML configurations to flexibly define the boot specifics for each board.
As of a few days ago, the latest official ci-tron gateway image contains everything needed to get an etnaviv DUT up and running successfully.
I have to say, I’m very impressed with the end result. It took a lot longer than I had anticipated, but we finally have a plug-and-play CI farm solution for etnaviv. There are still a few missing features—like Network Block Device (NBD) support and some advanced statistics—but the ci-tron team is doing an excellent job, and I’m optimistic about what’s coming next.
Conclusion: A Long Road, but Worth It
The journey to get the etnaviv board farm back online was longer than expected, full of unexpected challenges and technical hurdles. But it was worth it. The result is a robust, automated solution that makes CI testing easier and more reliable for everyone. With ci-tron, it’s easier to find and fix problems before they ever make it into the code base, which is exactly what a good CI setup should be all about. There is still some work to be done on the GitLab side to switch all etnaviv jobs to the new board farm.
If you’re thinking about setting up your own CI farm or migrating to ci-tron, I hope my experience helps smooth the road for you a bit. It might be a long journey, but the end results are absolutely worth it.
Unleashing the power of 3D graphics in the Raspberry Pi is a key commitment for
Igalia through its collaboration with Raspberry
Pi. The introduction of Super Pages for the
Raspberry Pi 4 and 5 marks another step in this journey, offering some
performance enhancements and more efficient memory usage. In this post, we’ll
dive deep into the technical details of Super Pages, discuss the challenges we
faced during implementation, and illustrate the benefits this feature brings to
the Raspberry Pi ecosystem.
What are Super Pages?
A Memory Management Unit (MMU) is a hardware component responsible for handling
memory access at the system level. It translates virtual addresses used by
programs into physical addresses in main memory, enabling efficient memory
management and protection. The MMU allows the operating system to allocate
memory dynamically, isolating processes from one another to prevent them from
interfering with each other’s memory.
Recommendation: 📚 Structured computer organization by Andrew Tanenbaum
The V3D MMU, which is part of the Broadcom GPU found in the Raspberry Pi 4 and
5, is responsible for translating 32-bit virtual addresses (VA) used by V3D into
40-bit physical addresses used externally to V3D. The MMU relies on a page
table, stored in physical memory, which maps virtual addresses to their
corresponding physical addresses. The operating system manages this page table,
and the MMU uses it to perform address translation during memory access.
A fundamental principle of modern operating systems is that memory is not stored
contiguously. Instead, a contiguous block of memory is divided into smaller
blocks, called “pages”, which are scattered across the entire address space.
These pages are typically 4KB in size. This approach enables more efficient
memory management and allows for features like virtual memory and memory
protection.
Over the years, the amount of available memory in computers has increased
dramatically. An early IBM PC had up to 640 KiB of RAM, whereas the ThinkPad I’m
typing on right now has 32 GB of RAM. Naturally, memory demands have grown
alongside this increase. Today, it’s common for web browsers to consume several
gigabytes of RAM, and a single shader can take up multiple megabytes.
As memory usage grows, a 4KB page size may become inefficient for managing large
memory blocks. Handling a large number of small pages for a single block means
the MMU must perform multiple address translations, which increases overhead.
This can reduce the effectiveness of the Translation Lookaside Buffer (TLB), as
it must store and handle more entries, potentially leading to more cache misses
and reduced overall performance.
This is why many CPU manufacturers have introduced support for larger page
sizes. For instance, x86 CPUs typically support 4KB and 2MB pages, with 1GB
pages available if supported by the hardware. Similarly, ARM64 CPUs can support
4KB, 16KB, and 64KB page sizes. These larger page sizes help reduce the number
of pages the MMU needs to manage, improving performance by reducing the overhead
of address translation and making more efficient use of the TLB.
So, if CPUs are using bigger sizes, why shouldn’t GPUs do the same?
By default, V3D supports 4KB pages. However, by setting specific bits in the
page table entry, it is possible to create 64KB “Big Pages” and 1MB “Super
Pages.” The issue is that the current V3D driver available in Linux does not
enable the use of Big or Super Pages, meaning this hardware feature is currently
unused.
The advantage of enabling Big and Super Pages is that once an entry for any page
within a Big or Super Page is cached in the MMU, it can be used to translate all
virtual addresses within that page’s range without needing to fetch additional
entries. In theory, this should result in improved performance, especially for
applications with high memory demands, such as those using multiple large buffer
objects (BOs).
As Igalia continually strives to enhance the experience for Raspberry Pi users,
we decided to implement this feature in the upstream kernel. But before diving
into the implementation details, let’s take a look at the real-world results and
see if the theoretical benefits of Super Pages have translated into measurable
improvements for Raspberry Pi users.
What Does This Feature Mean for RPi Users?
With Super Pages implemented, let’s now explore the actual performance
improvements observed on the Raspberry Pi and see how impactful this feature is
for users.
Benchmarking Super Pages: Traces and FPS Improvements
To measure the impact of Super Pages, we tested a variety of games and demos
traces on the Raspberry Pi 4 and 5, covering genres from action to racing. On
average, we observed a +1.40% FPS improvement on the Raspberry Pi 4 and a +1.30%
improvement on the Raspberry Pi 5.
For instance, on the Raspberry Pi 4, Warzone 2100 saw an 8.36% FPS increase,
and on the Raspberry Pi 5, Quake II enjoyed a 3.62% boost. These examples
demonstrate the benefits of Super Pages in resource-demanding applications,
where optimized memory handling becomes critical.
Raspberry Pi 4 FPS Improvements
Trace
Before Super Pages
After Super Pages
Improvement
warzone2100.30secs.1024x768.trace
56.39
61.10
+8.36%
ue4_shooter_game_shooting_low_quality_640x480.gfxr
20.71
21.47
+3.65%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr
60.88
62.50
+2.67%
supertuxkart-menus_1024x768.trace
112.62
115.61
+2.65%
ue4_shooter_game_shooting_high_quality_640x480.gfxr
20.45
20.88
+2.10%
quake2-gles3-1280x720.trace
59.76
60.84
+1.82%
ue4_sun_temple_640x480.gfxr
27.60
28.03
+1.54%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr
54.59
55.30
+1.29%
ue4_shooter_game_low_quality_640x480.gfxr
32.75
33.08
+1.00%
sponza_demo02_800x600.gfxr
20.90
21.03
+0.61%
supertuxkart-racing_1024x768.trace
8.58
8.63
+0.60%
ue4_shooter_game_high_quality_640x480.gfxr
19.62
19.74
+0.59%
serious_sam_trace02_1280x720.gfxr
44.00
44.21
+0.50%
ue4_vehicle_game-2_640x480.gfxr
12.59
12.65
+0.49%
sponza_demo01_800x600.gfxr
21.42
21.46
+0.19%
quake3e-1280x720.trace
84.45
84.52
+0.09%
Raspberry Pi 5 FPS Improvements
Trace
Before Super Pages
After Super Pages
Improvement
quake2-gles3-1280x720.trace
151.77
157.26
+3.62%
supertuxkart-menus_1024x768.trace
306.79
313.88
+2.31%
warzone2100.30secs.1024x768.trace
140.92
144.03
+2.21%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr
131.45
134.20
+2.10%
ue4_vehicle_game-2_640x480.gfxr
24.42
24.88
+1.89%
ue4_shooter_game_high_quality_640x480.gfxr
32.12
32.53
+1.29%
ue4_sun_temple_640x480.gfxr
42.05
42.55
+1.20%
ue4_shooter_game_shooting_high_quality_640x480.gfxr
52.77
53.31
+1.04%
quake3e-1280x720.trace
238.31
240.53
+0.93%
warzone2100.70secs.1024x768.trace
151.09
151.81
+0.48%
sponza_demo02_800x600.gfxr
50.81
51.05
+0.46%
supertuxkart-racing_1024x768.trace
20.91
20.98
+0.33%
ue4_shooter_game_low_quality_640x480.gfxr
59.68
59.86
+0.29%
quake3e_capture_frames_1_through_1800_1920x1080.gfxr
167.70
168.17
+0.29%
ue4_shooter_game_shooting_low_quality_640x480.gfxr
53.40
53.51
+0.22%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr
163.37
163.64
+0.17%
serious_sam_trace02_1280x720.gfxr
60.00
60.03
+0.06%
sponza_demo01_800x600.gfxr
45.04
45.04
<.01%
While an average +1% FPS improvement might seem modest, Super Pages can deliver
more noticeable gains in memory-intensive 3D applications and when the GPU is
under heavy usage. Let’s see how the Super Pages perform on Mesa CI.
Benchmarking Super Pages: Mesa CI Job Duration
To avoid introducing regressions in user-space, I usually test my custom kernels
with Mesa CI, focusing on the “broadcom-postmerge” stage to verify that all
Piglit and CTS tests ran smoothly. For Super Pages, I was pleasantly surprised
by the job duration results, as some job durations were reduced by several
minutes.
Mesa CI Jobs Duration Improvements
Job
Before Super Pages
After Super Pages
v3d-rpi4-traces:arm64
~4m30s
~3m40s
v3d-rpi5-traces:arm64
~3m30s
~2m45s
v3d-rpi4-gl-full:arm64 */6
~24-25 minutes
~22-23 minutes
v3d-rpi5-gl-full:arm64
~48 minutes
~48 minutes
v3dv-rpi4-vk-full:arm64 */6
~44 minutes
~41 minutes
v3dv-rpi5-vk-full:arm64
~102 minutes
~92 minutes
Seeing these reductions is especially rewarding. For example, the
“v3dv-rpi5-vk-full:arm64” job duration decreased by 10 minutes, meaning more FPS
for users and shorter wait times for Mesa developers.
Benchmarking Super Pages: PS2 Emulation
After sharing a couple of tables, I’ll admit that showcasing performance
improvements solely through numbers doesn’t always convey the real impact.
Personally, I find it more satisfying to see performance gains in action with
real-world applications.
This led me to explore PlayStation 2 (PS2) emulation on the RPi 5. From watching
YouTube videos, I noticed that PS2 is a popular console for the RPi 5. While the
PlayStation (PS1) emulates well even on the RPi 4, and Nintendo 64 and Sega
Saturn struggle across most hardware, PS2 hits a sweet spot for testing the RPi
5’s limits.
Fortunately, I still have my childhood PS2 — my second console after the
Nintendo GameCube, and one of the most successful consoles worldwide, including
in Brazil. With a library packed with titles like Metal Gear Solid, Resident
Evil, Tomb Raider, and Shadow of the Colossus, the PS2 remains a great
system for collectors and retro gamers alike.
I selected a few games from my collection to benchmark on the RPi 5 using a PS2
emulator. My emulator of choice was Aether
SX2 with Vulkan support.
Although AetherSX2 is no longer in development, it still performs well on the
RPi.
Initially, many games were barely playable, especially those with large buffer
objects, like Shadow of the Colossus and Gran Turismo 4. However, after enabling
Super Pages support, I noticed immediate improvements. For example, Shadow of
the Colossus wouldn’t even open before Super Pages, and while it’s not fully
playable yet, it does load now. This isn’t a silver bullet, but it’s a step
forward in improving the driver one piece at a time.
I ended up selecting four games for a video comparison: Burnout 3: Takedown,
Metal Gear Solid 3: Snake Eater, Resident Evil 4, and Tekken 4.
Your browser does not support the video tag.
Disclaimer: The BIOS used in the emulator was extracted from my own PS2,
and I played only games I own, with ROMs I personally extracted. Neither I nor
Igalia encourage using downloaded BIOS or ROM files from the internet.
From the video, we can see noticeable improvements in all four games. Although
they aren’t perfectly playable yet, the performance gains are evident,
particularly in Resident Evil 4, where the gameplay saw a solid 5 FPS boost. I
realize 18 FPS might not satisfy most players, but I still had a lot of fun
playing Resident Evil 4 on the RPi 5.
When tracking the FPS for these games, it’s clear that the performance gains go
well beyond the average 1% seen in other benchmarks. Super Pages show their true
potential in high-memory applications like PS2 emulation.
Having seen the performance gains Super Pages can bring to the Raspberry Pi,
let’s now dive into the technical aspects of the feature.
Implementing Super Pages
The first challenge was figuring out how to allocate a contiguous block of
memory using shmem. The Shared Memory Virtual Filesystem (shmem) is used as
a flexible memory mechanism that allows the GPU and CPU to share access to BOs
through the system’s temporary filesystem, tmpfs. tmpfs is a volatile
filesystem that stores files in RAM, making it ideal for temporary or high-speed
data that doesn’t need to persist on RAM.
For example, to allocate a 256KB BO across four 64KB pages, we need four
contiguous 64KB memory blocks. However, by default, tmpfs only allocates
memory in PAGE_SIZE chunks (as seen in shmem_file_setup()), whereas
PAGE_SIZE is 4KB on the Raspberry Pi 4 and 16KB on the Raspberry Pi 5. Since
the function drm_gem_object_init() — which initializes an allocated
shmem-backed GEM object — relies on shmem_file_setup() to back these objects
in memory, we had to consider alternatives, as the default PAGE_SIZE would
divide memory into increments that are too small to ensure the large, contiguous
blocks needed by the GPU.
The solution we proposed was to create drm_gem_object_init_with_mnt(), which
allows us to specify the tmpfs mountpoint where the GEM object will be
created. This enables us to allocate our BOs in a mountpoint that supports
larger page sizes. Additionally, to ensure that our BOs are allocated in the
correct mountpoint, we introduced drm_gem_shmem_create_with_mnt(), which
allows the mountpoint to be specified when creating a new DRM GEM shmem object.
[PATCH v6 04/11] drm/gem: Create a drm_gem_object_init_with_mnt() function
[PATCH v6 06/11] drm/gem: Create shmem GEM object in a given mountpoint
The next challenge was figuring out how to create a new mountpoint that would
allow for different page sizes based on the allocation. Simply creating a new
tmpfs mountpoint with a fixed bigger page size wouldn’t suffice, as we needed
flexibility for various allocations. Inspired by the i915 driver, we decided to
use a tmpfs mountpoint with the “huge=within_size” flag. This flag, which
requires the kernel to be configured with CONFIG_TRANSPARENT_HUGEPAGE, enables
the allocation of huge pages.
Transparent Huge Pages
(THP) is a kernel
feature that automatically manages large memory pages to improve performance
without needing changes from applications. THP dynamically combines smaller
pages into larger ones, typically 2MB, reducing memory management overhead and
improving cache efficiency.
To support our new allocation strategy, we created a dedicated tmpfs
mountpoint for V3D, called gemfs, which provides us an ideal space for managing
these larger allocations.
[PATCH v6 05/11] drm/v3d: Introduce gemfs
With everything in place for contiguous allocations, the next step was configuring V3D to enable Big/Super Page support.
We began by addressing a major source of memory pressure on the Raspberry Pi:
the current 128KB alignment for allocations in the virtual memory space. This
alignment wastes space when handling small BO allocations, especially since the
userspace driver performs a large number of these small allocations.
As a result, we can’t fully utilize the 4GB address space available for the GPU
on the Raspberry Pi 4 or 5. For example, we can currently allocate up to 32,000
BOs of 4KB (~140MB) and 3,000 BOs of 400KB (~1.3GB). This becomes a limitation
for memory-intensive applications. By reducing the page alignment to 4KB, we can
significantly increase the number of BOs, allowing up to 1,000,000 BOs of 4KB
(~4GB) and 10,000 BOs of 400KB (~4GB).
Therefore, the first change I made was reducing the VA alignment of all
allocations to 4KB.
[PATCH v6 07/11] drm/v3d: Reduce the alignment of the node allocation
With the alignment issue resolved, we can now implement the code to properly set
the flags on the Page Table Entries (PTE) for Big/Super Pages. Setting these
flags is straightforward — a simple bitwise operation. The challenge lies in
determining which BOs can be allocated in Super Pages. For a BO to be eligible
for a Big Page, its virtual address must be aligned to 64KB, and the same
applies to its physical address. Same thing for Super Pages, but now the
addresses must be aligned to 1MB.
If the BO qualifies for a Big/Super Page, we need to iterate over 16 4KB pages
(for Big Pages) or 256 4KB pages (for Super Pages) and insert the appropriate
PTE.
Additionally, we modified the way we iterate through the BO’s memory. This was
necessary because the THP may not always allocate the entire BO contiguously.
For example, it might only allocate contiguously 1MB of a 2MB block. To handle
this, we now iterate over the blocks of contiguous memory scattered across the
scatterlist, ensuring that each segment is properly handled during the
allocation process.
What is a scatterlist? It is a Linux Kernel data structure that manages
non-contiguous memory as if it were contiguous. It organizes separate memory
blocks into a single logical buffer, allowing efficient data handling,
especially in Direct Memory Access (DMA) operations, without needing a
physically contiguous memory allocation.
[PATCH v6 08/11] drm/v3d: Support Big/Super Pages when writing out PTEs
However, the last few patches alone don’t fully enable the use of Super Pages.
While PATCH 08/11 technically allows for Super Pages, we’re still relying on DRM
GEM shmem objects, meaning allocations are still happening in PAGE_SIZE
chunks. Although Big/Super Pages could potentially be used if the system
naturally allocated 1MB or 64KB contiguously, this is quite rare and not our
intended outcome. Our goal is to actively use Big/Super Pages as much as
possible.
To achieve this, we’ll utilize the V3D-specific mountpoint we created earlier
for BO allocation whenever possible. By creating BOs through
drm_gem_shmem_create_with_mnt(), we can ensure that large pages are allocated
contiguously when possible, enabling the consistent use of Big/Super Pages.
[PATCH v6 09/11] drm/v3d: Use gemfs/THP in BO creation if available
And there you have it — Big/Super Pages are now fully enabled in V3D. The only
requirement to activate this feature in any given kernel is ensuring that
CONFIG_TRANSPARENT_HUGEPAGE is enabled.
Final Words
You can learn more about ongoing enhancements to the Raspberry Pi driver stack
in this XDC 2024 talk by
José María “Chema” Casanova Crespo. In the talk, Chema discusses the Super
Pages work I developed, along with other advancements in the driver stack.
Of course, there are still plenty of improvements on the horizon at Igalia. I’m
currently experimenting with 64KB CLE allocations in user-space, and I hope to
share more good news soon.
Finally, I’d like to express my gratitude to Iago
Toral and Tvrtko
Ursulin for their invaluable support in
developing Super Pages for the V3D kernel driver. Thank you both for sharing
your experience with me!
(I worked on this feature last year, before being moved off desktop related projects, but I never saw it documented anywhere other than in the original commit messages, so here's the opportunity to shine a little light on a feature that could probably see more use) The new usb_set_wireless_status() driver API function can be used by drivers of USB devices to export whether the wireless device associated with that USB dongle is turned on or not. To quote the commit message:This will be used by user-space OS components to determine whether the
battery-powered part of the device is wirelessly connected or not,
allowing, for example:
- upower to hide the battery for devices where the device is turned off
but the receiver plugged in, rather than showing 0%, or other values
that could be confusing to users
- Pipewire to hide a headset from the list of possible inputs or outputs
or route audio appropriately if the headset is suddenly turned off, or
turned on
- libinput to determine whether a keyboard or mouse is present when its
receiver is plugged in.This is not an attribute that is meant to replace protocol specific
APIs [...] but solely for wireless devices with
an ad-hoc “lose it and your device is e-waste” receiver dongle.
Currently, the only 2 drivers to use this are the ones for the Logitech G935 headset, and the Steelseries Arctis 1 headset. Adding support for other Logitech headsets would be possible if they export battery information (the protocols are usually well documented), support for more Steelseries headsets should be feasible if the protocol has already been reverse-engineered. As far as consumers for this sysfs attribute, I filed a bug against Pipewire (link) to use it to not consider the receiver dongle as good as unplugged if the headset is turned off, which would avoid audio being sent to headsets that won't hear it. UPower supports this feature since version 1.90.1 (although it had a bug that makes 1.90.2 the first viable release to include it), and batteries will appear and disappear when the device is turned on/off.A turned-on headset
Hi!
This month XDC 2024 took place in Montreal. I wasn’t there in-person, but
thanks to the organizers I could still ask questions and attend workshops
remotely (thanks!). As usual, XDC has been a great reminder of many things I
wanted to do but which got buried under a pile of emails. We’ve discussed the
upcoming KMS color management uAPI again, I’ve taken a bit of time to send more
comments and it looks like this one is getting close to completion (famous last
words). We’ve also discussed about display muxing (switching a connector from
one GPU to another one), it’s quite fun how surprisingly tricky this process
is. Another topic was better multi-GPU support, in particular how to avoid
going through the main GPU when an application is rendered and displayed on
a secondary GPU. I’ve sent a proposal to improve the kernel DMA-BUF uAPI.
New this year was the Wayland workshop organized by Mike Blumenkrantz,
Daniel Stone and Jonas Ådahl. We’ve discussed the governance change proposals
sent earlier this month. Various changes are being discussed, all have the goal
to lower the barrier to entry when contributing a protocol and preventing
patches from getting stuck. I’m excited to see how this turns out!
We’ve finally started the release candidate cycle for Sway 1.10. I’ve released
Sway 1.10-rc4 this weekend with a bunch more fixes, I’m hoping the final
release can go out soon! I’ve also released the long overdue cage 0.2.0,
which fast forwards wlroots to version 0.18 and adds primary selection support.
I’ve sent a patch to add a udmabuf allocator to wlroots. This is useful for
running the wlroots GLES2 and Vulkan renderers with software rendering
(e.g. llvmpipe and lavapipe), which is handy for CI and exercises the same
codepaths as real hardware instead of the seldom used Pixman renderer.
wlroots-rs has been updated to wlroots v0.18, and I’ve revamped the way the
compositor state is managed. Previously the library forced the use of
Rc<RefCell<T>> to hold the state, which caused issues with double mutable
borrows at runtime when compositor callbacks were nested (wlroots invokes
compositor callback which borrows state and calls into wlroots which invokes
another compositor callback which borrows state). With the new design the
compositor must pass its state as an argument to all wlroots functions which
may emit signals and call back into the compositor.
delthas has contributed a whole bunch of soju patches used by his new hosted
bouncer service, IRC Today. Uploaded videos and PDF files can now be viewed
inline in Web browsers, a new HTTP basic authentication backend has been added,
file uploads can now be delegated to a separate HTTP backend, a new
soju.im/SAFERATE specification indicates when clients don’t need to
rate-limit their messages, and a bunch of various smaller improvements and
fixes. A bunch of exciting new features are in the pipeline as well (but I
won’t spoil them just yet)!
Matthew Hague has contributed TLS certificate pinning to Goguma. When hitting
an invalid certificate, Goguma will now offer the user a choice to trust this
specific certificate (trust on first use). gamja now supports drag-and-drop for
file uploads thanks to xse. Both gamja and Goguma have moved to Codeberg, I
hope this lowers the barrier to entry for contributing. A tiny
NPotM is soju-containers¸ a
repository containing Dockerfiles for soju and gamja, for easy deployment and
testing.
Both hottub and yojo now have support for build secrets. For hottub,
secrets are only enabled when the owner pushes commits (and enables the feature
at setup time). For yojo, the owner needs to enable the feature at setup time
and can then select specific secrets to expose on specific repositories. All of
this is locked down to prevent collaborators from gaining access to arbitrary
secrets when pushing to a repository.
That’s all for now, see you next month!
Struggling
Last week was XDC. I did too much Wayland, and now I’ve been stricken with a plague for my hubris.
I have some updates, but I lack the ability to fully capture the exploits of Mesa’s most sane developer in the ten minutes I’m awake every day. In the meanwhile, let’s take a look another potential example of great hubris.
Hm.
Have you ever made a decision that seemed great at the time but then you realized later it was actually maybe not that great? Like, maybe it was actually really, uh, well, not dumb since nobody reading this blog would do something like that, but not…smart. And everyone else was kinda going along with your decision and trusting that you knew what you were talking about because let’s face it, you’re smart. Everyone knows how smart you are. That’s why they trust you to make these decisions.
Long-time SGC readers know I’m not one to make decisions of any kind, but we all remember that time Microsoft famously introduced Work Graphs to D3D and also (quietly) deprecated ExecuteIndirect. The argument was compelling: why not just move all the work to the GPU?
Haters described Work Graphs as just another attempt by the driver cartel to blame bugs on app developers by making tooling impossible. The rest of us were all in—We jumped on that bandwagon like it was the last triangle in the pipe before a crash. It wasn’t long before the high-powered players were aboard:
NVIDIA
AMD
Details were light at this stage. There were no benchmarks, no performance numbers, no games or applications using Work Graphs, but everyone trusted Microsoft. Everyone knew the idea of this tech was sound, that it had to be faster.
Microsoft doubled down: Work Graphs would support mesh nodes for drawing!
Other graphics wizards began to get involved. The developerverse was in a tizzy. Everyone wanted in on the action.
The hype train had departed the station.
Hm?
Six months after GDC, the first notable performance figures for Work Graphs were blogged about by AAA graphics rockstar, Kostas Anagnostou. I was at a Khronos F2F when it happened, and the number of laptop screens open to the post when it dropped was nonzero. Very nonzero.
At best, the figures were whelming.
Still there was no real analysis of Work Graph performance in comparison to alternative solutions. Haters will say I’m biased after recently shipping Vulkan’s device generated commands extension, but this was going to ship regardless since vkd3d-proton requires cross-vendor compatibility for ExecuteIndirect functionality used in games like Halo Infinite and Starfield. I’m all about the numbers. Show me the graphs. The perf graphs, that is.
Fortunately, friend of the blog and veteran vertex wrangler, Hans-Kristian Arntzen, always has my back. He’s spent the past few months heroically writing vkd3d-proton emulation for Work Graphs, and he has recently posted his findings to an obscure README in that repository.
READ IT. SERIOUSLY. YES, THIS IS A FULL PAGE-WIDTH LINK SO YOU CAN’T POSSIBLY MISS IT.
If you’re just here for the quick summary (which you shouldn’t be considering how much time he has spent making charts and graphs, and taking screenshots, and summing everything up in bite-sized morsels for easy consumption):
Across the board, Work Graph performance is not very exciting
Emulation with core Vulkan compute shader features is up to 3x faster
Comparison test cases against ExecuteIndirect (which show EI being worse) do not effectively leverage that functionality, as noted by Hans-Kristian nearly six months ago
The principle of charity requires taking serious claims in the best possible light. This should have yielded robust, powerful ExecuteIndirect benchmark usage (and even base compute/mesh shader usage) to provide competitive benchmarks against Work Graph functionality. At the time of writing, those benchmarks have yet to materialize, and the only test cases are closer to strawmen that can be held up for an easy victory.
I’m not saying that Work Graphs are inherently bad.
Yet.
At this point, however, I haven’t seen compelling evidence which validates the hype surrounding the tech. I haven’t seen great benchmarks and demos. Maybe it’s a combination of that and still-improving driver support. Maybe it’s as-yet available functionality awaiting future hardware. In any case, I haven’t seen a strong, fact-based technical argument which proves, beyond a doubt, that this is the future of graphics.
Before anyone else tries to jump on the Work Graph hype train, I think we owe it to ourselves to thoroughly interrogate this new paradigm and make sure it provides the value that everyone expects.
Gaming on Linux on M1 is here! We’re thrilled to release our Asahi
game playing toolkit, which integrates our Vulkan 1.3 drivers with x86
emulation and Windows compatibility. Plus a bonus: conformant OpenCL
3.0.
Asahi Linux now ships the only conformant OpenGL®,
OpenCL™,
and Vulkan®
drivers for this hardware. As for gaming… while today’s release is an
alpha, Control
runs well!
Installation
First, install Fedora Asahi
Remix. Once installed, get the latest drivers with
dnf upgrade --refresh &&
reboot. Then just dnf install steam and play. While all
M1/M2-series systems work, most games require 16GB of memory due to
emulation overhead.
The stack
Games are typically x86 Windows binaries rendering with DirectX,
while our target is Arm Linux with Vulkan. We need to handle each
difference:
FEX emulates x86 on Arm.
Wine translates Windows to
Linux.
DXVK and vkd3d-proton
translate DirectX to Vulkan.
There’s one curveball: page size. Operating systems allocate memory
in fixed size “pages”. If an application expects smaller pages than the
system uses, they will break due to insufficient alignment of
allocations. That’s a problem: x86 expects 4K pages but Apple systems
use 16K pages.
While Linux can’t mix page sizes between processes, it can
virtualize another Arm Linux kernel with a different page size. So we
run games inside a tiny virtual machine using muvm, passing through
devices like the GPU and game controllers. The hardware is happy because
the system is 16K, the game is happy because the virtual machine is 4K,
and you’re happy because you can play Fallout
4.
Vulkan
The final piece is an adult-level Vulkan driver, since translating
DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote
Honeykrisp,
the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK
support. Let’s look at some new features.
Tessellation
Tessellation enables games like The
Witcher 3 to generate geometry. The M1 has hardware
tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We
must instead tessellate with arcane compute shaders, as detailed in today’s talk at
XDC2024.
Geometry shaders
Geometry shaders are an older, cruder method to generate geometry.
Like tessellation, the M1 lacks geometry shader hardware so we emulate
with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs.
They don’t need to be fast – just fast enough for games like Ghostrunner.
Enhanced robustness
“Robustness” permits an application’s shaders to access buffers
out-of-bounds without crashing the hardware. In OpenGL and Vulkan,
out-of-bounds loads may return arbitrary elements, and out-of-bounds
stores may corrupt the buffer. Our OpenGL driver exploits
this definition for efficient robustness on the M1.
Some games require stronger guarantees. In DirectX, out-of-bounds
loads return zero, and out-of-bounds stores are ignored. DXVK therefore
requires VK_EXT_robustness2,
a Vulkan extension strengthening robustness.
Like before, we implement robustness with compare-and-select
instructions. A naïve implementation would compare a loaded
index with the buffer size and select a zero result if
out-of-bounds. However, our GPU loads are vector while arithmetic is
scalar. Even if we disabled page faults, we would need up to four
compare-and-selects per load.
load R, buffer, index * 16
ulesel R[0], index, size, R[0], 0
ulesel R[1], index, size, R[1], 0
ulesel R[2], index, size, R[2], 0
ulesel R[3], index, size, R[3], 0
There’s a trick: reserve 64 gigabytes of zeroes using
virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in
64 gigabytes, any index into this region loads zeroes. For out-of-bounds
loads, we simply replace the buffer address with the reserved address
while preserving the index. Replacing a 64-bit address costs just two
32-bit compare-and-selects.
ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo
ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi
load R, buffer, index * 16
Two instructions, not four.
Next steps
Sparse texturing is next for Honeykrisp, which will unlock more DX12
games. The alpha already runs DX12 games that don’t require sparse, like
Cyberpunk
2077.
While many games are playable, newer AAA titles don’t hit 60fps
yet. Correctness comes first. Performance improves next. Indie
games like Hollow
Knight do run full speed.
Beyond gaming, we’re adding general purpose x86 emulation based on
this stack. For more information, see
the FAQ.
Today’s alpha is a taste of what’s to come. Not the final form, but
enough to enjoy Portal
2 while we work towards “1.0”.
Acknowledgements
This work has been years in the making with major contributions
from…
Alyssa Rosenzweig
Asahi Lina
chaos_princess
Davide Cavalca
Dougall Johnson
Ella Stanforth
Faith
Ekstrand
Janne
Grunau
Karol Herbst
marcan
Mary Guillemard
Neal Gompa
Sergio López
TellowKrinkle
Teoh Han Hui
Rob
Clark
Ryan Houdek
… Plus hundreds of developers whose work we build upon, spanning the
Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the
magic of open source.
We hope you enjoy the magic.
Happy gaming.
A few years ago Mike and I discussed adding video support to zink, so that we could provide vaapi on top of vulkan video implementations.This of course got onto a long TODO list and we nerdsniped each other into moving it along, this past couple of weeks we finally dragged it over the line.This MR adds initial support for zink video decode on top of Vulkan Video. It provides vaapi support. Currently it only support H264 decode, but I've implemented AV1 decode and I've played around a bit with H264 encode. I think adding H265 decode shouldn't be too horrible.I've tested this mainly on radv, and a bit on anv (but there are some problems I should dig into).
TLDR: if you know what EVIOCREVOKE does, the same now works for hidraw devices via HIDIOCREVOKE.
The HID standard is the most common hardware protocol for input devices. In the Linux kernel HID is typically translated to the evdev protocol which is what libinput and all Xorg input drivers use. evdev is the kernel's input API and used for all devices, not just HID ones.
evdev is mostly compatible with HID but there are quite a few niche cases where they differ a fair bit. And some cases where evdev doesn't work well because of different assumptions, e.g. it's near-impossible to correctly express a device with 40 generic buttons (as opposed to named buttons like "left", "right", ...[0]). In particular for gaming devices it's quite common to access the HID device directly via the /dev/hidraw nodes. And of course for configuration of devices accessing the hidraw node is a must too (see Solaar, openrazer, libratbag, etc.). Alas, /dev/hidraw nodes are only accessible as root - right now applications work around this by either "run as root" or shipping udev rules tagging the device with uaccess.
evdev too can only be accessed as root (or the input group) but many many moons ago when dinosaurs still roamed the earth (version 3.12 to be precise), David Rheinsberg merged the EVIOCREVOKE ioctl. When called the file descriptor immediately becomes invalid, any further reads/writes will fail with ENODEV. This is a cornerstone for systemd-logind: it hands out a file descriptor via DBus to Xorg or the Wayland compositor but keeps a copy. On VT switch it calls the ioctl, thus preventing any events from reaching said X server/compositor. In turn this means that a) X no longer needs to run as root[1] since it can get input devices from logind and b) X loses access to those input devices at logind's leisure so we don't have to worry about leaking passwords.
Real-time forward to 2024 and kernel 6.12 now gained the HIDIOCREVOKE for /dev/hidraw nodes. The corresponding logind support has also been merged. The principle is the same: logind can hand out an fd to a hidraw node and can revoke it at will so we don't have to worry about data leakage to processes that should not longer receive events. This is the first of many steps towards more general HID support in userspace. It's not immediately usable since logind will only hand out those fds to the session leader (read: compositor or Xorg) so if you as application want that fd you need to convince your display server to give it to you. For that we may have something like the inputfd Wayland protocol (or maybe a portal but right now it seems a Wayland protocol is more likely).
But that aside, let's hooray nonetheless. One step down, many more to go.
One of the other side-effects of this is that logind now has an fd to any device opened by a user-space process. With HID-BPF this means we can eventually "firewall" these devices from malicious applications: we could e.g. allow libratbag to configure your mouse' buttons but block any attempts to upload a new firmware. This is very much an idea for now, there's a lot of code that needs to be written to get there. But getting there we can now, so full of optimism we go[2].
[0] to illustrate: the button that goes back in your browser is actually evdev's BTN_SIDE and BTN_BACK is ... just another button assigned to nothing particular by default.
[1] and c) I have to care less about X server CVEs.
[2] mind you, optimism is just another word for naïveté
I'm happy to announce that the last tweaks have landed and that the fully FOSS libcamera software ISP based IPU6 camera support in Fedora 41 now has no known bugs left. See the Changes page for testing instructions.Supported hardwareUnlike USB UVC cameras where all cameras work with a single kernel driver, MIPI cameras like the Intel IPU6 cameras require multiple drivers. The IPU6 input-system CSI receiver driver is common to all laptops with an IPU6 camera, but different laptops use different camera sensors and each sensor needs its own driver and then there are glue ICs like the LJCA USB IO-expander and the iVSC (Intel Visual Sensing Controller) and there also is the ipu-bridge code which translates Windows oriented ACPI tables with sensor info into the fwnodes which the Linux drivers expect.This means that even though IPU6 support has landed in Fedora 41 not all laptops with an IPU6 camera will work. Currently the IPU6 integrated in the following CPU models works if the sensor + glue hw/sw is also supported:Tiger LakeAlder LakeRaptor LakeJasper Lake and Meteor Lake also have an IPU6 but there is some more integration work necessary to get things to work there. Getting Meteor Lake IPU6 cameras to work is high on my TODO list.The mainline kernel IPU6 CSI receiver + libcamera software ISP has been successfully tested on the following models:Various Lenovo ThinkPad models with ov2740 (INT3474) sensor (1)Various Dell models with ov01a10 (OVTI01A0) sensorDell XPS 13 PLus with ov13b10 (OVTIDB10/OVTI13B1)Some HP laptops with hi556 sensor (INT3537)To see which sensor your laptop has run: "ls /sys/bus/i2c/devices" this will show e.g. "i2c-INT3474:00" if you have an ov2740, with INT3474 being the ACPI Hardware ID (HID) for the sensor. See here for a list of currently known HID to sensor mappings. Note not all of these have upstream drivers yet. In that cases chances are that there might be a sensor driver for your sensor here.We could really use help with people submitting drivers from there upstream. So if you have a laptop with a sensor which is not in the mainline but is available there, you know a bit of C-programming and you are willing to help, then please drop me an email so that we can work together to get the driver upstream.1) on some ThinkPads the ov2740 sensor fails to start streaming most of the time. I plan to look into this next week and hopefully I can come up with a fix.MIPI camera Integration work done for Fedora 41After landing the kernel IPU6 CSI receiver and libcamera software ISP support upstream early in the Fedora 41 cycle, there still was a lot of work to do with regards to integrating this into the rest of the stack so that the cameras can actually be used outside of the qcam test app.The whole stack looks like this "kernel → libcamera → pipewire | pipewire-camera-consuming-app". Where the 2 currently supported pipewire-camera consuming apps are Firefox and GNOME Snapshot.Once this was all up and running testing found quite a few bugs which have all been fixed now:Firefox showing 13 different cameras in its camera selection pulldown for a single IPU6 camera (fix).Installing pipewire-plugin-libcamera leads to UVC cameras being powered on all the time causing significant battery drain (bug, bug, discussion, fix).Pipewire does not always recognizes cameras on login (bug, bug, bug, fix).Pipewire fails to show cameras with relative controls (fix).spa_libcamera_buffer_recycle sometimes fails, causing stream to freeze on first frame (bug, fix)Firefox chooses bad default resolution of 640x480. I worked with Jan Grulich to get this fixed and this is fixed as of firefox-130.0.1-3.fc41. Thank you Jan!Snapshot prefers 4:3 mode, e.g. 1280x1080 on 16:9 camera sensors capable of 1920x1080 (pending fix)Added intel-vsc-firmware, pipewire-plugin-libcamera, libcamera-ipa to the Fedora 41 Workstation default package-set (pull, pull, pull) comments
Finally!
Yesterday Khronos published Vulkan 1.3.296 including VK_EXT_device_generated_commands.
Thousands of engineering hours seeing the light of day, and awesome news for Linux gaming.
Device-Generated Commands, or DGC for short, are Vulkan’s equivalent to ExecuteIndirect in Direct3D 12.
Thanks to this extension, originally based on a couple of NVIDIA vendor extensions, it will be possible to prepare sequences of commands to run directly from the GPU, and executing those sequences directly without any data going through the CPU.
Also, Proton now has a much-more official leg to stand on when it has to translate ExecuteIndirect from D3D12 to Vulkan while you run games such as Starfield.
The extension not only provides functionality equivalent to ExecuteIndirect.
It goes beyond that and offers more fine-grained control like explicit preprocessing of command sequences, or switching shaders and pipelines with each sequence thanks to something called Indirect Execution Sets, or IES for short, that potentially work with ray tracing, compute and graphics (both regular and mesh shading).
As part of my job at Igalia, I’ve implemented CTS tests for this extension and I had the chance to work very closely with an awesome group of developers discussing specification, APIs and test needs.
I hope I don’t forget anybody and apologize in advance if so.
Mike Blumenkrantz, of course.
Valve contractor, Super Good Coder and current OpenGL Working Group chair who took the initial specification work from Patrick Doane and carried it across the finish line.
Be sure to read his blog post about DGC.
Also incredibly important for me: he developed, and kept up-to-date, an implementation of the extension for lavapipe, the software Vulkan driver from Mesa.
This was invaluable in allowing me to create tests for the extension much faster and making sure tests were in good shape when GPU driver authors started running them.
Spencer Fricke from LunarG.
Spencer did something fantastic here.
For the first time, the needed changes in the Vulkan Validation Layers for such a large extension were developed in parallel while tests and the spec were evolving.
His work will be incredibly useful for app developers using the extension in their games.
It also allowed me to detect test bugs and issues much earlier and fix them faster.
Samuel Pitoiset (Valve contractor), Connor Abbott (Valve contractor), Lionel Landwerlin (Intel) and Vikram Kushwaha (NVIDIA) providing early implementations of the extension, discussing APIs, reporting test bugs and needs, and making sure the extension works as good as possible for a variety of hardware vendors out there.
To a lesser degree, most others mentioned as spec contributors for the extension, such as Hans-Kristian Arntzen (Valve contractor), Baldur Karlsson (Valve contractor), Faith Ekstrand (Collabora), etc, making sure the spec works for them too and makes sense for Proton, RenderDoc, and drivers such as NVK and others.
If you’ve noticed, a significant part of the people driving this effort work for Valve and, from my side, the work has also been carried as part of Igalia’s collaboration with them.
So my explicit thanks to Valve for sponsoring all this work.
If you want to know a bit more about DGC, stay tuned for future talks about this topic.
In about a couple of weeks, I’ll present a lightning talk (5 mins) with an overview at XDC 2024 in Montreal.
Don’t miss it!
Day 4 of Wayland governance hacking
I wake at 5 AM. This is the perfect time to wake up in NYC TZ, as it affords me the ability to eat a whole
apple in the time it takes my little internet-browsing chromebook to load all the IRC and Discord backlogs from
the five hours that I snuck away for a nap when nobody was watching.
I slather the apple with a haphazard scoop of peanut butter; getting away from a keyboard for more than twenty six
seconds in a given stretch is difficult, and I need protein. While entering into a fraught negotiation over
the meaning of 30-day discussion period with my left hand, I carefully scoop protein powder into a shaker with my right.
There’s no time to waste. Not even a single second–Another argument could break out, steal a 1973 Pontiac Firebird, and
go joyriding on the wrong side of the freeway.
I’m writing this blog post with my toes. They know their way around a keyboard, but they’re slow and prone to mistakes.
My cat is in charge of hitting an oversized backspace key when I dangle his favorite toy over it. It’ll be hours
before we get something together that can be read coherently.
This is my life now.
This is what it takes to do Open Source.
Final Day: Everything, Everywhere, All At Once
I’ve put up a couple sizable proposals to resolve longstanding issues and oversights in the governance model. Today is Friday,
however, which means it’s the final day. Once we hit the weekend, everyone will collectively fuck off and forget everything
that happened this week, which means I have to maintain peak velocity and finish strong.
Let’s fucking go.
Last proposal.
Problem 1: HOW IS THIS %#$@$#@#$%%$ PROTOCOL STILL STUCK AFTER 4 YEARS?!?!?!?!?
It’s a great question. I asked it myself. The answers are myriad and nebulous, but I’m the guy who explains things, so I’m gonna
break it down.
Imagine you’re wayland-protocols. You’ve got all these puppies. And you’re walking them–so you tell yourself, but really they’re walking themselves.
They’re walking you. And they’re going in whatever direction they want. And out of all these puppies you’ve got two,
one’s trying to go left to chase a car, and the other one’s trying to sniff a telephone pole on the right. The other
fifty seven puppies just want to keep moving because they love their walkies. But these two puppies are the biggest ones,
and they’re pulling the others along with them. So now your leashes are getting all tangled, and you’re being dragged around,
and everyone’s pointing at you because you look like you don’t know what you’re doing.
That’s where we’re at now. Everyone’s laughing at you.
Look at this idiot trying to walk fifty nine puppies at once. This absolute moron. Who would ever do that? Why not just walk
one or maybe two puppies at a time like everyone else? That’s the way you’re supposed to walk them. The way people have always walked them.
But you know what? Walking fifty nine puppies individually would take all day. Nobody has the time to walk fifty nine puppies individually no
matter how cute or eloquent they are. So you need some way to resolve this. Or something.
Look, you get where I’m going with this.
Wayland protocol discussions get bogged down by people throwing out hypotheticals that can’t truly be resolved, or by people talking past
each other, or by people disappearing, or the phase of the moon, or any number of reasons, and there’s no official way to get
past these blockages. That’s why I’m proposing tie-breaker votes as a simple way of moving past these problems when they arise.
Everyone understands tie-breakers: you vote, and the side with the most votes wins. It’s that simple.
In this context, the wayland-protocols member projects
vote (with one of them representing the author for non-members) and the majority wins. If there’s another tie, the author gets to break it.
Simple. Done.
Problem 2: Perfect Is the Enemy of Good
Sometimes a protocol in staging/ is “good enough”. The author has checked out, people are using the protocol,
and everyone is happy with it.
But it’s still not a stable/ protocol.
In this scenario, after an extended period of time without changes, any staging/ protocol can be nominated by a
member project for stable/ promotion.
Some discussion happens, and then it becomes stable.
Simple. Done.
Problem 3: Start Times
The governance model talks about discussion periods, but it doesn’t specify exactly when they begin. For example, on any of my governance MRs,
does the 30-day period start when I open the MR or when the MR is approved?
Obviously it starts when I open the MR. We gotta keep things moving.
Done.
Problem 4: Project Representation
The governance document specifies that a member project
may have up to two official representatives. This can be problematic, as it puts pressure on 1-2 people to be on top of
every active protocol discussion.
Instead, projects should be represented by as many individuals as they want (pending the usual process for adding points-of-contact). This
ensures that protocols don’t get blocked waiting for a given project to take a look when all representatives are busy. It also helps more diverse
projects (e.g., wlroots) ensure that opinions from more of its constituents are officially represented.
Each project still only gets one vote, but now that vote can be more readily deployed and voiced.
I think we’re done here?
From what I’ve seen, this should cover all the major issues that have been negatively impacting Wayland development. Sure, there are other,
more minor issues, but I’m not aware of anything that can’t be solved through good old person-to-person discussion.
Maybe all this works, and maybe it doesn’t. But at least now if we decide to throw away some puppies, nobody can question whether we
really tried everything.
Big.
While other development has been progressing, in the background I’ve been working on something big. Now, finally, I can talk about it.
VK_EXT_device_generated_commands is a new extension which, it’s no exageration to say, is the biggest thing Vulkan has shipped since ray-tracing. I had the privilege of working with people across the industry while driving it, from both desktop and mobile hardware vendors, and despite it being EXT, we’re going to see some truly broad adoption here.
Big shoutout to Patrick Doane, formerly of Activision-Blizzard and now (I think) at Deviation Games, for kickstarting this many years ago. Thanks for your work. I hope you’re satisfied with the final product.
What does this do?
DGC enables applications to record commands from shaders to then be executed directly. This means no more ping-ponging back and forth between CPU and GPU, which can help to eliminate performance bottlenecks. See also the NV extension and D3D12 ExecuteIndirect as prior art.
While this functionality is used in big games such as Starfield and Halo Infinite, those examples are ETOOBIG to really comprehend. Also the code is proprietary, so I can’t share it publicly. Also I don’t have the code.
Fortunately, I’ve hacked together a small demo program for people to look over to get a feel for the functionality.
dgcgears is a rough fork of vkgears from mesa-demos (thanks to zink’s own godfather, Erik Faye-Lund for the original work!) which utilizes DGC to execute draws rather than record them directly.
Now here’s where the crazy stuff starts.
Changing shaders from shaders
EXT DGC adds the ability to change shaders from shaders. By creating an Indirect Execution Set, multiple sets of shaders can be bundled together and indexed into from within shaders. dgcgears uses a different vertex shader to draw each gear.
While the NV extension had this functionality, EXT takes it further, enabling it to be supported on all hardware.
Shader Objects: fully supported
Another big feature of EXT DGC is that it is agnostic to pipelines vs shader objects vs whatever new stuff comes out in the future. If you prefer one over the other, you’re free to go ahead and use that.
VKD3D-proton: supported
I’ve already written the code, and it should land at some point.
Drivers: supported
ANV
Lavapipe
NVIDIA
NVK
RADV
Turnip
other drivers soon
3
Device. Generated. Commands.
Count ‘em.
Rejection
It’s hard. Nobody likes that feeling, especially after putting in a bunch of work, double-especially when that work is on a Wayland protocol.
That’s right, the target of today’s wayland-protocols governance update: NACKs.
A NACK is intended to mean something like:
this idea does not belong in wayland-protocols for [technical reason]
It’s supposed to be the last resort when all other alternatives and gentler nudges have been exhausted.
There’s been a lot of confusion over this concept over the years, specifically along the lines of:
Who can actually NACK?
When can NACKs be used?
What’s stopping my protocol from being NACKed?
I’m glad you asked.
Definition
I’ve put up a comprehensive proposal to reform and define the NACK. The short of it is:
Only people in this file can NACK a protocol
NACKs can only be used for extreme circumstances to block a protocol which does not belong in wayland-protocols
NACKs now carry consequences if they are used improperly, including the potential removal of anyone using them improperly
This should cover all the basic cases. It’s important to remember that a NACK can always be removed, which is to say that there’s always room for discussion in Open Source.
If you’re considering submitting a protocol proposal, don’t worry too much about this! A NACK won’t ever be the first thing you see, and you’ll have ample time and room to discuss your ideas before anyone even considers bringing it up.
Hey everyone!
The 2024 Linux Display Next
hackfest concluded
in May, and its outcomes continue to shape the Linux Display stack. Igalia
hosted this year’s event in A Coruña, Spain, bringing together leading experts
in the field. Samuel Iglesias and I
organized this year’s edition and this blog post summarizes the experience and
its fruits.
One of the highlights of this year’s hackfest was the wide range of backgrounds
represented by our 40 participants (both on-site and remotely). Developers and
experts from various companies and open-source projects came together to
advance the Linux Display ecosystem. You can find the list of participants
here.
The event covered a broad spectrum of topics affecting the development of Linux
projects, user experiences, and the future of display technologies on Linux.
From cutting-edge topics to long-term discussions, you can check the event
agenda
here.
Organization Highlights
The hackfest was marked by in-depth discussions and knowledge sharing among
Linux contributors, making everyone inspired, informed, and connected to the
community. Building on feedback from the previous year, we refined the
unconference format to enhance participant preparation and engagement.
Structured Agenda and Timeboxes: Each session had a defined scope, time
limit (1h20 or 2h10), and began with an introductory talk on the topic.
Participant-Led Discussions: We pre-selected in-person participants to
lead discussions, allowing them to prepare introductions, resources, and
scope.
Transparent Scheduling: The schedule was shared in advance as GitHub
issues, encouraging participants to review and prepare for sessions of
interest.
Engaging Sessions: The hackfest featured a variety of topics, including
presentations and discussions on how participants were addressing specific
subjects within their companies.
No Breakout Rooms, No Overlaps: All participants chose to attend all
sessions, eliminating the need for separate breakout rooms. We also adapted
run-time schedule to keep everybody involved in the same topics.
Real-time Updates: We provided notifications and updates through
dedicated emails and the event matrix room.
Strengthening Community Connections: The hackfest offered ample
opportunities for networking among attendees.
Social Events: Igalia sponsored coffee breaks, lunches, and a dinner at
a local restaurant.
Museum Visit: Participants enjoyed a sponsored visit to the Museum of
Estrela Galicia Beer (MEGA).
Fruitful Discussions and Follow-up
The structured agenda and breaks allowed us to cover multiple topics during the
hackfest. These discussions have led to new display feature development and
improvements, as evidenced by patches, merge requests, and implementations in
project repositories and mailing lists.
With the KMS color management API taking shape, we discussed refinements and
best approaches to cover the variety of color pipeline from different
hardware-vendors. We are also investigating techniques for a performant
SDR<->HDR content reproduction and reducing latency and power consumption when
using the color blocks of the hardware.
Color Management/HDR
Color Management and HDR continued to be the hottest topic of the hackfest. We
had three sessions dedicated to discuss Color and HDR across Linux Display
stack layers.
Color/HDR (Kernel-Level)
Harry Wentland (AMD) led this session.
Here, kernel Developers shared the Color Management pipeline of AMD, Intel and
NVidia. We counted with diagrams and explanations from HW-vendors developers
that discussed differences, constraints and paths to fit them into the KMS
generic color management properties such as advertising modeset needs,
IN\_FORMAT, segmented LUTs,
interpolation types, etc. Developers from Qualcomm and ARM also added
information regarding their hardware.
Upstream work related to this session:
KMS color management properties (new version - v5);
IGT Tests;
drm_info draft support of v4 DRM/KMS plane color properties;
gamescope draft support of v4 DRM/KMS plane color properties;
Kwin WIP implementation of DRM/KMS plane color properties.
Color/HDR (Compositor-Level)
Sebastian Wick (RedHat) led this session.
It started with Sebastian’s presentation covering Wayland color protocols and
compositor implementation. Also, an explanation of APIs provided by Wayland and
how they can be used to achieve better color management for applications and
discussions around ICC profiles and color representation metadata. There was
also an intensive Q&A about LittleCMS with Marti Maria.
Upstream work related to this session:
Wayland color management protocol;
Wayland color representation protocol;
HDR support merged on Mutter;
Color management protocol on Mutter;
Color management protocol on GTK.
Color/HDR (Use Cases and Testing)
Christopher Cameron (Google) and Melissa Wen (Igalia) led this session.
In contrast to the other sessions, here we focused less on implementation and
more on brainstorming and reflections of real-world SDR and HDR transformations
(use and validation) and gainmaps. Christopher gave a nice presentation
explaining HDR gainmap images and how we should think of HDR. This presentation
and Q&A were important to put participants at the same page of how to
transition between SDR and HDR and somehow “emulating” HDR.
We also discussed on the usage of a kernel background color property.
Finally, we discussed a bit about
Chamelium
and the future of VKMS (future work and maintainership).
Power Savings vs Color/Latency
Mario Limonciello (AMD) led this session.
Mario gave an introductory presentation about AMD ABM (adaptive backlight
management) that is similar to Intel DPST. After some discussions, we agreed on
exposing a kernel property for power saving policy. This work was already
merged on kernel and the userspace support is under development.
Upstream work related to this session:
Kernel series: Add support for ‘power saving policy’ property (merged)
Mutter: issue: support for “power saving policy” property
Kwin: MR Draft: backends/drm: add support for the “power saving policy” property
Strategy for video and gaming use-cases
Leo Li (AMD) led this session.
Miguel Casas (Google) started this session with a presentation of Overlays in
Chrome/OS Video, explaining the main goal of power saving by switching off GPU
for accelerated compositing and the challenges of different colorspace/HDR for
video on Linux.
Then Leo Li presented different strategies for video and gaming and we
discussed the userspace need of more detailed feedback mechanisms to understand
failures when offloading. Also, creating a debugFS interface came up as a tool
for debugging and analysis.
Real-time scheduling and async KMS API
Xaver Hugl (KDE/BlueSystems) led this session.
Compositor developers have exposed some issues with doing real-time scheduling
and async page flips. One is that the Kernel limits the lifetime of realtime
threads and if a modeset takes too long, the thread will be killed and thus the
compositor as well. Also, simple page flips take longer than expected and
drivers should optimize them.
Another issue is the lack of feedback to compositors about hardware programming
time and commit deadlines (the lastest possible time to commit). This is
difficult to predict from drivers, since it varies greatly with the type of
properties. For example, color management updates take much longer.
In this regard, we discusssed implementing a hw_done callback to timestamp
when the hardware programming of the last atomic commit is complete. Also an
API to pre-program color pipeline in a kind of A/B scheme. It may not be
supported by all drivers, but might be useful in different ways.
VRR/Frame Limit, Display Mux, Display Control, and more… and beer
We also had sessions to discuss a new KMS API to mitigate headaches on VRR
and Frame Limit as different brightness level at different refresh rates,
abrupt changes of refresh rates, low frame rate compensation (LFC) and precise
timing in VRR more.
On Display Control we discussed features missing in the current KMS
interface for HDR mode, atomic backlight settings, source-based tone mapping,
etc. We also discussed the need of a place where compositor developers can post
TODOs to be developed by KMS people.
The Content-adaptive Scaling and Sharpening session focused on sharpening
and scaling filters. In the Display Mux session, we discussed proposals to
expose the capability of dynamic mux switching display signal between discrete
and integrated GPUs.
In the last session of the 2024 Display Next Hackfest, participants
representing different compositors summarized current and future work and built
a Linux Display “wish list”, which includes: improvements to VTTY and HDR
switching, better dmabuf API for multi-GPU support, definition of tone mapping,
blending and scaling sematics, and wayland protocols for advertising to clients
which colorspaces are supported.
We closed this session with a status update on feature development by
compositors, including but not limited to: plane offloading (from libcamera to
output) / HDR video offloading (dma-heaps) / plane-based scrolling for web
pages, color management / HDR / ICC profiles support, addressing issues such as
flickering when color primaries don’t match, etc.
After three days of intensive discussions, all in-person participants went to a
guided tour at the Museum of Extrela Galicia beer (MEGA), pouring and tasting
the most famous local beer.
Feedback and Future Directions
Participants provided valuable feedback on the hackfest, including suggestions
for future improvements.
Schedule and Break-time Setup: Having a pre-defined agenda and schedule
provided a better balance between long discussions and mental refreshments,
preventing the fatigue caused by endless discussions.
Action Points: Some participants recommended explicitly asking for action
points at the end of each session and assigning people to follow-up tasks.
Remote Participation: Remote attendees appreciated the inclusive setup
and opportunities to actively participate in discussions.
Technical Challenges: There were bandwidth and video streaming issues
during some sessions due to the large number of participants.
Thank you for joining the 2024 Display Next Hackfest
We can’t help but thank the 40 participants, who engaged in-person or virtually
on relevant discussions, for a collaborative evolution of the Linux display
stack and for building an insightful agenda.
A big thank you to the leaders and presenters of the nine sessions: Christopher
Cameron (Google), Harry Wentland (AMD), Leo Li (AMD), Mario Limoncello (AMD),
Sebastian Wick (RedHat) and Xaver Hugl (KDE/BlueSystems) for the effort in
preparing the sessions, explaining the topic and guiding discussions. My
acknowledge to the others in-person participants that made such an effort to
travel to A Coruña: Alex Goins (NVIDIA), David Turner (Raspberry Pi), Georges
Stavracas (Igalia), Joan Torres (SUSE), Liviu Dudau (Arm), Louis Chauvet
(Bootlin), Robert Mader (Collabora), Tian Mengge (GravityXR), Victor Jaquez
(Igalia) and Victoria Brekenfeld (System76). It was and awesome opportunity to
meet you and chat face-to-face.
Finally, thanks virtual participants who couldn’t make it in person but
organized their days to actively participate in each discussion, adding
different perspectives and valuable inputs even remotely: Abhinav Kumar
(Qualcomm), Chaitanya Borah (Intel), Christopher Braga (Qualcomm), Dor Askayo,
Jiri Koten (RedHat), Jonas Ådahl (Red Hat), Leandro Ribeiro
(Collabora), Marti Maria (Little CMS), Marijn Suijten, Mario Kleiner, Martin
Stransky (Red Hat), Michel Dänzer (Red Hat), Miguel Casas-Sanchez (Google),
Mitulkumar Golani (Intel), Naveen Kumar (Intel), Niels De Graef (Red Hat),
Pekka Paalanen (Collabora), Pichika Uday Kiran (AMD), Shashank Sharma (AMD),
Sriharsha PV (AMD), Simon Ser, Uma Shankar (Intel) and Vikas Korjani (AMD).
We look forward to another successful Display Next hackfest, continuing to
drive innovation and improvement in the Linux display ecosystem!
I <3 Open Source
That should be obvious by now, right? I’ve been out here blogging about Open Source stuff for over a decade, and occasionally I still have time to actually write code.
Haha. But seriously.
I believe in Open Source. I believe in the collaborative development model, the teamwork, the shared vision of contributing to projects that can stand up to and even surpass big-name proprietary products.
I believe in getting shit done. I believe in discussion, in deliberation, in review, but at the end of the day I also believe that people need to be realistic and accept compromises rather than dying on every fucking hill of a gitlab review comment. I believe in it enough that I’m sleeping five or fewer hours a night this week.
And believe it or not, I’m also a Wayland developer.
Conflict: Just Part of the Process
I work for Valve. I’ve blogged about what it’s like working for Valve, and nothing has changed since then. It’s still great, and Rhys Perry (not you, the other one) is still an unsung hero.
As I mentioned in that post, one of the great things about my job is the freedom to improve projects without managerial overhead. To self-determine. To see the goal in the distance and get there in a way that suits me.
I’m not the only person working for Valve, and I’m not the only one who enjoys this freedom. By now, everyone has seen the latest happenings with regards to the efforts to improve Wayland display mechanisms around refresh rates and swapchain-related stuttering. I sympathize with the frustration radiating out of this effort. I see myself in it; this is someone who saw a problem, saw a goal in the distance, and chose a path forward that suited them.
I don’t believe there was malice or ill-will involved here, just growing frustration at a long-standing problem that was defying efforts to resolve it. I’ve been there. We’ve all been there. Everyone has had at least some Open Source interaction where a review has stalled, or gotten heated, or failed to progress in finite time for some reason.
It happens to be the case that wayland-protocols, as a project, has these sorts of interactions more often than other projects. Again, I don’t believe there is malice or ill-will involved. Maybe I’m optimistic, or an idealist, or whatever, but I believe everyone participating in Wayland development wants it to succeed. Maybe they tunnel-vision too much, maybe they get sidetracked and disappear, maybe they get heated because WHY CAN’T YOU JUST UNDERSTAND THAT I KNOW WHTA I’M TALKNG ABOUT??!!1, maybe they post long walls of text that nobody really wants to read but inevitably you know there’s some kind of a point hidden somewhere in there so you gotta just reach in and tweeze it out with your fingers like it’s stuck between the cushions of your sofa so you can tell them what an idiot they are, etc. Again, we’ve all been there.
This is how Open Source works, though. This is how the sausage is made. It’s not always pretty. It’s not always easy. It’s not free as in freedom or beer, it’s free as in puppies: sometimes they just do what they want, and you gotta be okay with that.
Resolutions
Open Source has disagreements, and I’m okay with that. I still believe that at the end of the day, everyone is capable of stepping back, seeing the goal, and arriving at some sort of agreement that gets us all there together.
It’s for that reason I’ve proposed some updates to wayland-protocols governance to address some of the systemic issues that have persisted since before there was official ‘governance’. I agree that the system is a little broken, but I don’t think the solution is to throw out the whole system.
We can’t just throw away puppies. Well, we can, but it’s complicated and people might get the wrong idea. We gotta have a real good reason to throw those puppies out on the street. If we don’t at least say we tried to make it work, to get them to stop chewing on our shoes, and peeing on the sofa, and eating the sandwiches we forget about because we gotta go back and argue with those idiots in Mesa who can’t understand the brilliance of our latest idea to ship OpenGL ray-tracing–if we can’t at least show that we cannot fundamentally coexist with these puppies, then some people might think we aren’t capable of being good dog owners. They might wonder whether their puppies are safe around us, or worse: whether their GPUs are.
The problems facing wayland-protocols are many, and my blogging time is (somehow) not infinite. In short:
new stuff hard
stable/ stuff harder
The first problem is more tractable, and it’s the one causing the most frustration at present, so that’s where I started. An inability to land new protocols, even just for broader development purposes, hurts the ecosystem, both by stifling innovation and frustrating would-be contributors. Many solutions have been proposed over the years, though few have been official. There was one from @emersion last year that sort of almost gained traction but then also wasn’t quite what people wanted. There was also…
That’s it, actually. That’s the only time an official proposal has been made to change the governance in a substantial way. Which is to say that though everyone knew and acknowledged problems exist, nobody (else) took the action to try solving it, as plainly spelled out by section 4.1 of the governance model document.
It’s Open Source all the way down: Just create a merge request*.
* Yes, I know it says only members can propose changes, but surely someone could have wrangled one of the members and drafted something? Surely the governance members would react positively to a good-faith proposal made by a non-member? We don’t have to act like every other person on the planet is out to get us, do we?
To be clear, I’m not blaming anyone.
I get it.
Complaining about hard problems is way easier than fixing them. I complain a lot too. That’s why I have this blog, where I can complain about spaghetti, bad memes, and drivers that are faster than minethat have yet to be surpassed—you know what SGC is about by now.
At the end of the day, however, sometimes this is what peak Open Source performance looks like:
Hi!
Once again, this status update will be rather short due to limited time
bandwidth. I hope to be able to allocate a bit more time slots for my
open-source projects next month.
We’re getting closer to a new Sway release (fingers crossed), with lots of help
from Kenny and Alexander to iron out the remaining bugs. We’ve just shipped
wlroots 0.18.1 today (thanks to Simon Zeni for leading the backporting
efforts!). I’ve been expanding wlroots’ explicit synchronization support by
adapting our multi-GPU logic, the Vulkan renderer and the libliftoff backend.
I’ve released wayland 1.23.1 with some Clang and wayland-scanner fixes.
I’ve ported the cage kiosk compositor to wlroots 0.18. Last but not least,
I’ve rewritten makoctl in C because shell scripts only get you so far.
I’ve been giving feedback and contributing to KDE’s SVG cursor spec. The
cursor theme landscape isn’t in a great spot at the moment, because we’re stuck
with XCursor images. Now that the cursor-shape protocol is gaining adoption
there is an opportunity to more easily switch the underlying image format.
Thanks to KDE folks for pushing this forward! I’d really like to see the spec
standardized under the freedesktop.org umbrella.
delthas has been contributing some nifty new features to soju: admins can now
configure per-user network count limits, can now impersonate a user via SASL,
and the file upload endpoint now sends back an error early when the file is too
large. soju 0.8.2 has been released with a bunch of bug fixes.
The NPotM is varlinkgen
(better name TBD). It’s a Varlink C library and code generator. If you’ve
been following my projects for a while, you probably
know how much
I love code generators producing type-safe APIs from schemas.
I must admit, I appreciate Varlink’s simplicity and lack of central bus. I plan
to use varlinkgen in kanshi, and maybe other daemons in need of an IPC.
See you next month!
The New Reality
I’ve been traveling a bit lately, but Mesa has reached an important landmark that I wanted to broadcast to the three users out there who have been waiting years for us to reach this milestone:
Mesa now* supports all the GL OVR extensions.
You read that correctly.
Nearly a decade after the extensions were drafted and half that since everyone in VR-land moved to Vulkan, Mesa will now support all three of the people who still write VR applications in GL.
Big thanks to Marek for doing the GLSL heavy lifting and whoever eventually rubber stamps the followup which adds the last extension which is definitely being used.
* Only zink users are cool enough to withstand the awesome power of these extensions.
Recently there have been a number of reports (bug 2183743, bug 2276698, bug 2283839, bug 2312355) about the plymouth boot splash not showing properly on PCs using AMD GPUs.The problem without plymouth and AMD GPUs is that the amdgpu driver is a really really big driver, which easily takes up to 10 seconds to load on older PCs. The delay caused by this may cause plymouth to timeout while waiting for the GPU to be initialized, causing it to fallback to the 3 dot text-mode boot splash.There are 2 workaround for this depending on the PCs configuration:1. With older AMD GPUs the radeon driver is actually used to drive the GPU but even though it is unused the amdgpu driver still loads slowing things down.To check if this is the case for your PC start a terminal in a graphical login session and run: "lsmod | grep -E '^radeon|^amdgpu'" this will output something like this:amdgpu 17829888 0radeon 2371584 37The second number after each is the usage count. As you can see in this example the amdgpu driver is not used. In this case you can disable the loading of the amdgpu driver by adding "modprobe.blacklist=amdgpu" to your kernel commandline:sudo grubby --update-kernel=ALL --args="modprobe.blacklist=amdgpu"2. If the amdgpu driver is actually used on your PC then plymouth not showing can be worked around by telling plymouth to use the simpledrm drm/kms device created from the EFI framebuffer early on boot, rather then waiting for the real GPU driver to load. Note this depends on your PC booting in EFI mode. To do this run:sudo grubby --update-kernel=ALL --args="plymouth.use-simpledrm"After using 1 of these workarounds plymouth should show normally again on boot (and booting should be a bit faster). comments
What Am I Even Doing
It was some time ago that I created my first MR touching WSI stuff.
That was also the first time I broke Mesa.
Did I learn anything?
The answer is no, but then again it would have to be given the topic of this sleep-deprived post.
Maybe I’m The Problem
WSI has a lot of issues, but most of them stem from its mere existence. If people stopped wanting to see the triangles, we would all have easier lives and performance would go through the fucking roof. That’s ignoring the raw sweat and verbiage dedicated to insane ideas like determining the precise time at which the triangles should be made visible on a display or literally how are colors.
I’m nowhere near as smart as the people arguing about these things: I’m the guy who plays jenga with the tower constructed from popsicle sticks, marshmallow fluff, and wishful thinking. That’s why, a while ago, I declared war on DRI interfaces and then also definitely won that war without any issues. In fact, it works perfectly.
But why did I embark upon this journey which required absolutely no fixups?
The answer lies in architecture. In the before-times, DRI (a massively overloaded acronym that no longer means anything) allowed Xorg to plug directly into Mesa to utilize hardware-accelerated rendering. It sidestepped the GL API in favor of a contract with Mesa that certain API would never change. And that was great for Xorg since it provided an optimal path to do xserver stuff. But it was (eventually) terrible for Mesa.
Renegotiation
When an API contract is made, it remains binding forever. A case when the contract is broken is called a Bug Report. Mesa has no bugs, however, except for the ones I didn’t cause, and so this DRI contract that enables Xorg to shortcut more sensible APIs like EGL remains identical to this day, decades later. What is not identical, however, is Mesa.
In those intervening years, Mesa has developed into an entire ecosystem for driver development and other, less sane ideas. Gallium was created and then became the only method for implementing GL drivers. EGL and GBM are things now. But still, that DRI contract remains binding. Xorg must work. Like that one reviewer who will suggest changes for every minuscule flaw in your stupid, idiotic, uneducated, cretinous abuse of whitespace, it is not going away.
DRIL was the method by which Mesa could finally unshackle itself. The only parts of DRI still used by Xorg are for determining rendertarget capabilities, effectively eglGetConfigs. So @ajax and I punted out the relevant API into a stub which is mostly just a wrapper around eglGetConfigs. This enabled change and cleanup in every part of the codebase that was previously immutable.
Bidirectional Hell
As anyone who has tried to debug Mesa’s DRI frontend knows, it sucks. It’s one of the worst pieces of code to debug. A significant reason for this is (was) how the DRI callback system perpetuated circular architecture.
At the time of DRIL’s merge, a user of GLX/EGL/GBM would engage with this sort of control flow:
GLX/EGL/GBM API call
direct API internals
function pointer into gallium/frontends/dri
DRI frontend
function pointer back to GLX/EGL/GBM
<loop back to 2 until operation completes>
return to user
In terms of functionality, it was functional. But debugging at a glance was impossible, and trying to eyeball any execution path required the type of PhD held by fewer than five people globally. The cyclical back-and-forth function pointering was a vertical cliff of a learning curve for anyone who didn’t already know how things worked, and even things as “simple” as eglInitialize went through several impenetrable cycles of idiot-looping to determine success or failure. The absolute state of it made adding new features a nightmarish and daunting prospect, and reviewing any changes had, at best, even odds of breaking things because of how difficult it is to test this stuff.
Better Now?
Maybe.
The juiciest refactoring is over, and now function pointering only occurs when the DRI frontend needs to access API-specific data for its drawables. It’s actually possible to follow execution just by reading the code. Not that it’s necessarily easy, but it’s possible.
There’s still a lot of work to be done here. There’s still some corner case bugs with DRIL, there’s probably EGL issues that have yet to be discovered because much of that code is still fairly opaque, and half the codebase is still prefixed with dri2_.
At the least, I think it’s now possible to work on WSI in Mesa and have some idea what’s going on. Or maybe I’ve just been down in the abyss for so long that I’m the one staring back.
Onward
I’ve been cooking. I mean like really cooking. Expect big things related to the number 3 later this month.
* UPDATE: At the urging of my legal team, I’ve been advised to mention that no part of this post, blog, or site has any association with, bearing on, or endorsement from Half Life 3.
Introduction #
The topic of a Direct Rendering Manager (DRM) cgroup controller is something which has been proposed a few times in the past, but so far is still missing from the Linux graphics stack. Some of those attempts were focusing on controlling the GPU memory usage aspect, while some were concerned with scheduling. As I am continuing to explore this area as part of my work at Igalia, in this post we will discuss one possible way of implementing the latter.
General problem statement which we are trying to address is the fact many GPUs (and their respective kernel drivers) can simultaneously schedule workloads from different clients and that there are use-cases where having external control over scheduling decisions would be beneficial.
But first to clarify what we mean by “external control”. By that term we refer to the scheduling decisions being influenced from the outside of the actual process doing the rendering. If we were to draw a parallel to CPU scheduling, that would be the difference between a process (or a thread) issuing a system call such as setpriority(2) or nice(2) itself (“internal control”), versus its scheduling priority being modified by an external entity such as the user issuing the renice(1) shell command, launching the executable via the nice(1) shell command, or even using the CPU scheduling cgroup controller (“external control”).
This has two benefits. Firstly, it is the user who typically knows which tasks are higher priority and which should run in the background and therefore be as much as it is possible isolated from starving the foreground tasks from resources. Secondly, external control can be applied on any process in an unified manner, without the need for applications to individually expose the means to control their scheduling priority.
If we now return back to the world of GPU scheduling we find ourselves in a landscape where internal scheduling control is possible with many GPU drivers, but the external control is not. To improve on that there are some technical and conceptual challenges, because GPUs are not as nice and uniform in their scheduling needs and capabilities as CPUs are, but if we would be able to come up with something reasonable even if not perfect, it could bring improvements to the user experience in a variety of scenarios.
Past attempts - Priority based controllers #
The earliest attempt I can remember was from 2018, by Matt Roper[1], who proposed to implement a driver-specific priority based controller. The RFC limited itself to i915 (kernel driver for Intel GPUs) and, although the priority-based setup is well established in the world of CPU scheduling, and it is easy to understand its effects, the proposal did not gain much traction.
Because of the aforementioned advantages, when I proposed my version of the controller in 2022[2], it also included a slightly different version of a priority-based controller. In contrast to the earlier one, this proposal was in principle driver-agnostic and the priority levels were also abstracted.
The proposal was also accompanied by benchmark results showing that the approach was effective in allowing users on Linux to launch GPU tasks in the background, while leaving more GPU bandwidth to the foreground task than when not using the controller. Similarly on ChromeOS, when wired into the focused versus un-focused window cgroup management, it was able to demonstrate relatively more GPU time given to the foreground window.
Current proposal - Weight based controller #
Anticipating the potential lack of sufficient support for this approach the same RFC also included a second controller which takes a different route. It abstracts things one step further and implements a weight based controller based on GPU utilisation[3].
The basic idea is that the GPU time budget is split based on relative group weights across the cgroup hierarchy, and that the controller notifies the individual DRM drivers when their clients are over budget. From there it is left for the individual drivers to know how to best manage this situation, depending on the specific scheduling capabilities of the driver and the GPU hardware.
The user interface completely mimics the exiting CPU and IO cgroup controllers with the single drm.weight control file. The weights carry no absolute meaning and are only relative within a single group of siblings. Their only purpose is to split out the time budget between them.
Visually one potential cgroup configuration could look like this:
The DRM cgroup controller then executes a periodic scanning task which queries each DRM client for its GPU usage and notifies drivers when clients are over their allocated budget.
If we expand the concept with runtime adjustment of group weights based on window focus status, with two graphically active clients such as a game and a web browser, we can end up with the following two scenarios:
Here we show the actual GPU utilisation of each group together with their drm.weight. On the left hand side the web browser is the focused window, with the weights 100-to-10 in its favour.
The compositor is not using its full 200 / (200 + 100) so a portion is passed on to the desktop group to the extent of the full 80% required. Inside the desktop group the game is currently using 70%, while its actual allocation is 80% * (10 / (100 + 10)) = 7.27%. Therefore it is currently consuming is more than the budget and the corresponding DRM driver will be notified by the controller and will be able to do something about it.
After the user has given focus to the game window, relative weights will be adjusted and so will the budgets. Now the web browser will be over budget and therefore it can be throttled down, limiting the effect of its background activity on the foreground game window.
First driver implementation - i915 #
Back when I started developing this idea Intel GPU’s were my main focus, which is why i915 was the first driver I wired up with the controller.
There I implemented a rather simple approach of dynamically adjusting the scheduling priority of the throttled contexts, to the amount proportional to how much client is over budget in relative terms.
Implementation would also cross-check against the physical engine utilisation, since in i915 we have easy access to that metric, and only throttle if the latter is close to being fully utilised. (Why this makes sense could be an interesting digression relating to the fact that a single cgroup can in theory contain multiple GPUs and multiple clients using a mix of those GPUs. But lets leave that for later.)
One of the scenarios I used to test how well this works is to run two demanding GPU clients, each in its own cgroup, tweak their relative weights, and see what happens. The results were encouraging and are shown in the following table.
We can see that, when a clients group weight was decreased, the GPU bandwidth it was receiving also went down, as a consequence of the lowered context priority after receiving the over-budget notification.
This is a suitable moment to mention how the DRM cgroup controller does not promise perfect control, that is, achieving the actual GPU sharing ratios as expressed by group-relative weights. As we have mentioned before, GPU scheduling is not nearly at the same level of quality and granularity as in the CPU world, so the goal it sets is simply to improve things - do something which has a positive impact on user experience. At the same time, the mechanism and control interface proposed does not preclude individual drivers doing as good job as they can. Or even a future possibility of replacing the inner workings with a controller with something smarter, with no need to change the user space control interface.
Going back to the initial i915 implementation, the second test I have done was attempting to wire up with the background/foreground window focus handling in ChromeOS. There I experimented with a game (Android VM) running in parallel with a WebGL demo in a browser. At a certain point after both clients were running I lowered the weight of the background game and on the below screenshot we can see how the FPS metric in a browser jumped up.
This illustrates how having the controller can indeed improve the user experience. The user’s focus will be at the foreground window and therefore it does make sense to prioritise GPU access to that client for better interactiveness and smoother rendering there. In fact, in this example the actual FPS jumped from around 48-49 to 60fps. Meaning that throttling the background client has allowed the foreground one to match its rendering to display’s refresh rate.
Second implementation - amdgpu #
AMD’s kernel module was the next interesting driver which I wired up with the controller.
The fact that its scheduling is built on top of the DRM scheduler with only three distinct priority levels mandated a different approach to throttling. We keep a sorted list of “most offending” clients (most out of budget, or most borrowed unused budget from the sibling group), with the idea that the top client on that list gets throttled by lowering its scheduling priority. That was relatively straightforward to implement and sounded like it could potentially satisfy the most basic use case of background task isolation.
To test the runtime behaviour we set up two sibling cgroups and vary their relative scheduling weights. In one cgroup we run glxgears with vsync turned off and log its frame rate over time, while in the second group we run glmark2.
Let us first have a look on how glxgears frame rate varies during this test, depending on three different scheduling weight ratios between the cgroups. Scheduling weight ratio is expressed as glxgears:glmark2 ie. 10:1 means glxgears scheduling weight was ten times as much as configured for glmark2.
We can observe that, as the glmark2 is progressing through its various sub-benchmarks, glxgears frame rate is changing too. But it was overall higher in the runs where the scheduling weight ratio was in its favour. That is a positive result showing that even a simple implementation seems to be having the desired effect, at least to some extent.
For the second test we can look from the perspective of glmark2, checking how the benchmark score change depending on the ratio of scheduling weights.
Again we see that the scores are generally improving when the scheduling weight ratio is increased in favour of the benchmark.
However, in neither case the change of the result is proportional to actual ratios. This is because the primitive implementation is not able to precisely limit the “background” client, but is only able to achieve some throttling. Also, there is an inherent delay in how fast the controller can react given the control loop is based on periodic scanning. This period is configurable and was set to two seconds for the above tests.
Conclusion #
Hopefully this write-up has managed to demonstrate two main points:
First, that a generic and driver agnostic approach to DRM scheduling cgroup controller can improve user experience and enable new use cases. While at the same time following the established control interface as it exists for CPU and IO control, which makes it future-proof and extendable;
Secondly, that even relatively basic driver implementations can be somewhat effective in providing positive control effects.
It also probably needs to be re-iterated that neither the driver implementations or the cgroup controller implementation itself are limited by the user interface proposed. Both could be independently improved under the hood in the future.
What is next? There is more work to be done such as conducting more detailed testing, polishing the implementation and potentially attempting to wire up more drivers to the controller. Further advocacy work in the DRM community too.
References #
https://lore.kernel.org/dri-devel/20180120015141.10118-1-matthew.d.roper@intel.com/ ↩︎
https://lore.kernel.org/lkml/20221019173254.3361334-1-tvrtko.ursulin@linux.intel.com/ ↩︎
https://lore.kernel.org/lkml/ZVE3shwiRbUQyAqs@mtj.duckdns.org/T/ ↩︎
There's been a couple of mentions of Rust4Linux in the past week or two, one from Linus on the speed of engagement and one about Wedson departing the project due to non-technical concerns. This got me thinking about project phases and developer types.Archetypes:I will regret making an analogy, in an area I have no experience in, but let's give it a go with a road building analogy. Let's sort developers into 3 rough categories. Let's preface by saying not all developers fit in a single category throughout their careers, and some developers can do different roles on different projects, or on the same project simultaneously.1. Wayfinders/MapmakersI want to go build a hotel somewhere but there exists no map or path. I need to travel through a bunch of mountains, valleys, rivers, weather, animals, friendly humans, antagonistic humans and some unknowns. I don't care deeply about them, I want to make a path to where I want to go. I hit a roadblock, I don't focus on it, I get around it by any means necessary and move onto the next one. I document the route by leaving maps, signs. I build a hotel at the end. 2. Road buildersI see the hotel and path someone has marked out. I foresee that larger volumes will want to traverse this path and build more hotels. The roadblocks the initial finder worked around, I have to engage with. I engage with each roadblock differently. I build a bridge, dig a tunnel, blow up some stuff, work with with/against humans, whatever is necessary to get a road built to the place the wayfinder built the hotel. I work on each roadblock until I can open the road to traffic. I can open it in stages, but it needs a completed road.3. Road maintainersI've got a road, I may have built the road initially. I may no longer build new roads. I've no real interest in hotels. I deal with intersections with other roads controlled by other people, I interact with builders who want to add new intersections for new roads, and remove old intersections for old roads. I fill in the holes, improve safety standards, handle the odd wayfinder wandering across my 8 lanes.Interactions:Wayfinders and maintainers is the most difficult interaction. Wayfinders like to move freely and quickly, maintainers have other priorities that slow them down. I believe there needs to be road builders engaged between the wayfinders and maintainers.Road builders have to be willing to expend the extra time to resolving roadblocks in the best way possible for all parties. The time it takes to resolve a single roadblock may be greater than the time expended on the whole wayfinding expedition, and this frustrates wayfinders. The builder has to understand what the maintainers concerns are and where they come from, and why the wayfinder made certain decisions. They work via education and trust building to get them aligned to move past the block. They then move down the road and repeat this process until the road is open. How this is done might change depending on the type of maintainers.Maintainer types:Maintainers can fall into a few different groups on a per-new road basis, and how do road builders deal with existing road maintainers depends on where they are for this particular intersection:1. Positive and engaged Aligned with the goal of the road, want to help out, design intersections, help build more roads and more intersections. Will often have helped wayfinders out.2. Positive with real concernsAgrees with the road's direction, might not like some of the intersections, willing to be educated and give feedback on newer intersection designs. Moves to group 1 or trusts that others are willing to maintain intersections on their road.3. Negative with real concernsDon't agree fully with road's direction or choice of building material. Might have some resistance to changing intersections, but may believe in a bigger picture so won't actively block. Hopefully can move to 1 or 2 with education and trust building. 4. Negative and unwillingDon't agree with the goal, don't want the intersection built, won't trust anyone else to care about their road enough. Education and trust building is a lot more work here, and often it's best to leave these intersections until later, where they may be swayed by other maintainers having built their intersections. It might be possible to build a reduced intersection. but if they are a major enough roadblock in a very busy road, then a higher authority might need to be brought in.5. Don't care/DisengagedDoesn't care where your road goes and won't talk about intersections. This category often just need to be told that someone else will care about it and they will step out of the way. If they are active blocks or refuse interaction then again a higher authority needs to be brought in.Where are we now?I think the r4l project has a had lot of excellent wayfinding done, has a lot of wayfinding in progress and probably has a bunch of future wayfinding to do. There are some nice hotels built. However now we need to build the roads to them so others can build hotels. To the higher authority, the road building process can look slow. They may expect cars to be driving on the road already, and they see roadblocks from a different perspective. A roadblock might look smaller to them, but have a lot of fine details, or a large roadblock might be worked through quickly once it's engaged with. For the wayfinders the process of interacting with maintainers is frustrating and slow, and they don't enjoy it as much as wayfinding, and because they still only care about the hotel at the end, when a maintainer gets into the details of their particular intersection they don't want to do anything but go stay in their hotel. The road will get built, it will get traffic on it. There will be tunnels where we should have intersections, there will be bridges that need to be built from both sides, but I do think it will get built.I think my request from this is that contributors should try and identify the archetype they currently resonate with and find the next group over to interact with.For wayfinders, it's fine to just keep wayfinding, just don't be surprised when the road building takes longer, or the road that gets built isn't what you envisaged.For road builder, just keep building, find new techniques for bridging gaps and blowing stuff up when appropriate. Figure out when to use higher authorities. Take the high road, and focus on the big picture.For maintainers, try and keep up with modern road building, don't say 20 year old roads are the pinnacle of innovation. Be willing to install the rumble strips, widen the lanes, add crash guardrails, and truck safety offramps. Understand that wayfinders show you opportunities for longer term success and that road builders are going to keep building the road, and the result is better if you engage positively with them.
Hi!
After months of bikeshedding finishing touches we’ve finally merged
ext-image-capture-source-v1 and ext-image-copy-capture-v1 in
wayland-protocols! These two new protocols supersede the old wlr-screencopy-v1
protocol. They unlock some nice features such as toplevel and cursor capture,
as well as improved damage tracking. Thanks a lot to Andri Yngvason! He’s
written a blog post about the new protocols with more details. The
wlroots MR doesn’t have toplevel capture implemented yet, but that’s next on
the TODO list.
In other Wayland news, we’ve merged full support for
explicit synchronization in wlroots. This generally results in a
better system architecture than implicit synchronization, reduces
over-synchronization for complicated pipelines, and makes wlroots work
correctly with drivers lacking implicit synchronization support (e.g. NVIDIA).
Alexander has implemented automatic X11 surface restacking in
wlroots’ scene-graph. That way, all scene-graph compositors get proper X11
stack handling for free (Sway’s implementation was buggy). This should fix
issues where the X11 server and the compositor don’t have the same idea of the
relative ordering of surfaces, resulting in clicks going “through” windows or
reaching invisible windows.
Ricardo Steijn has contributed Sway support for tearing-control-v1.
This allows users to opt-in to immediate page-flips which don’t wait for the
vertical sync point (VSync) to program new frames into the hardware. For
tearing to be enabled, two conditions need to be fulfilled: tearing needs to
be enabled per-output via the output allow_tearing command, and tearing
needs to be enabled per-application either via the tearing-control-v1 Wayland
protocol or manually via the window allow_tearing command. I’ve also pushed
kernel patches from André Almeida and me to fix a few bugs around tearing
page-flips with the atomic KMS API, so once these land forcing the legacy KMS
API shouldn’t be necessary anymore.
drm_info v2.7.0 has been released with a few new features and cleanups.
Support for DRM_CLIENT_CAP_CURSOR_PLANE_HOTSPOT and
DRM_CAP_ATOMIC_ASYNC_PAGE_FLIP has been added, and a new flag has been
introduced to display information from a JSON dump.
Last, I’ve released a new version of go-maildir with a brand new API. Instead
of referring to messages by their Maildir key and phishing back their full
filename on each operation, the API exposes a Message type. It should be much
nicer to use than the previous one.
That’s all for August, see you next month!
The Freedesktop.org Specifications directory contains a list of common specifications that have accumulated over the decades and define how common desktop environment functionality works. The specifications are designed to increase interoperability between desktops. Common specifications make the life of both desktop-environment developers and especially application developers (who will almost always want to maximize the amount of Linux DEs their app can run on and behave as expected, to increase their apps target audience) a lot easier.
Unfortunately, building the HTML specifications and maintaining the directory of available specs has become a bit of a difficult chore, as the pipeline for building the site has become fairly old and unmaintained (parts of it still depended on Python 2). In order to make my life of maintaining this part of Freedesktop easier, I aimed to carefully modernize the website. I do have bigger plans to maybe eventually restructure the site to make it easier to navigate and not just a plain alphabetical list of specifications, and to integrate it with the Wiki, but in the interest of backwards compatibility and to get anything done in time (rather than taking on a mega-project that can’t be finished), I decided to just do the minimum modernization first to get a viable website, and do the rest later.
So, long story short: Most Freedesktop specs are written in DocBook XML. Some were plain HTML documents, some were DocBook SGML, a few were plaintext files. To make things easier to maintain, almost every specification is written in DocBook now. This also simplifies the review process and we may be able to switch to something else like AsciiDoc later if we want to. Of course, one could have switched to something else than DocBook, but that would have been a much bigger chore with a lot more broken links, and I did not want this to become an even bigger project than it already was and keep its scope somewhat narrow.
DocBook is a markup language for documentation which has been around for a very long time, and therefore has older tooling around it. But fortunately our friends at openSUSE created DAPS (DocBook Authoring and Publishing Suite) as a modern way to render DocBook documents to HTML and other file formats. DAPS is now used to generate all Freedesktop specifications on our website. The website index and the specification revisions are also now defined in structured TOML files, to make them easier to read and to extend. A bunch of specifications that had been missing from the original website are also added to the index and rendered on the website now.
Originally, I wanted to put the website live in a temporary location and solicit feedback, especially since some links have changed and not everything may have redirects. However, due to how GitLab Pages worked (and due to me not knowing GitLab CI well enough…) the changes went live before their MR was actually merged. Rather than reverting the change, I decided to keep it (as the old website did not build properly anymore) and to see if anything breaks. So far, no dead links or bad side effects have been observed, but:
If you notice any broken link to specifications.fd.o or anything else weird, please file a bug so that we can fix it!
Thank you, and I hope you enjoy reading the specifications in better rendering and more coherent look!
I'm happy to announce that my first project regarding support for the NPU in NXP's i.MX 8M Plus SoC has reached the feature complete stage.CC BY-NC 4.0 Henrik BoyeFor the last several weeks I have been working full-time on adding support for the NPU to the existing Etnaviv driver. Most of the existing code that supports the NPU in the Amlogic A311D was reused, but NXP used a much more recent version of the NPU IP so some advancements required new code, and this in turn required reverse engineering.This work has been kindly sponsored by the Open Source consultancy Ideas On Board, for which I am very grateful. I hope this will be useful to those companies that need full mainline support in their products, even if it is just the start.This company is unique in working on both NPU and camera drivers in Linux mainline, so they have the best experience for products that require long term support and vision processing.Since the last update I have fixed the last bugs in the compression of the weights tensor and implemented support for a new hardware-assisted way of executing depthwise convolutions. Some improvements on how the tensor addition operation is lowered to convolutions was needed as well.Performance is pretty good already, allowing for detecting objects in video streams at 30 frames per second, so at a similar performance level as the NPU in the Amlogic A311D. Some performance features are left to be implemented, so I think there is still substantial room for improvement.At current the code is at a very much proof-of-concept state. The next step is cleaning it all up and submitting for review to
Mesa3D. In the meantime, you can find the draft code at
https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-imx8mp.A
big thanks to Philipp Zabel who reverse engineered the bitstream format
of the weight encoding and added some patches to the kernel that were required for the NPU to work reliably.
In a previous post I gave the context for my pet project ieee1275-rs, it is a framework to build bootable ELF payloads on Open Firmware (IEEE 1275). OF is a standard developed by Sun for SPARC and aimed to provide a standardized firmware interface that was rich and nice to work with, it was later adopted by IBM, Apple for POWER and even the OLPC XO.The crate is intended to provide a similar set of facilities as uefi-rs, that is, an abstraction over the entry point and the interfaces. I started the ieee1275-rs crate specifically for IBM’s POWER platforms, although if people want to provide support for SPARC, G3/4/5s and the OLPC XO I would welcome contributions.
There are several ways the firmware takes a payload to boot, in Fedora we use a PReP partition type, which is a ~4MB partition labeld with the 41h type in MBR or 9E1A2D38-C612-4316-AA26-8B49521E5A8B as the GUID in the GPT table. The ELF is written as raw data in the partition.
Another alternative is a so called CHRP script in “ppc/bootinfo.txt”, this script can load an ELF located in the same filesystem, this is what the bootable CD/DVD installer uses. I have yet to test whether this is something that can be used across Open Firmware implementations.
To avoid compatibility issues, the ELF payload has to be compiled as a 32bit big-endian binary as the firmware interface would often assume that endianness and address size.
The entry point
As I entered this problem I had some experience writing UEFI binaries, the entry point in UEFI looks like this:
#![no_main]
#![no_std]
use uefi::prelude::*;
#[entry]
fn main(_image_handle: Handle, mut system_table: SystemTable<Boot>) -> Status {
uefi::helpers::init(&mut system_table).unwrap();
system_table.boot_services().stall(10_000_000);
Status::SUCCESS
}
Basically you get a pointer to a table of functions, and that’s how you ask the firmware to perform system functions for you. I thought that maybe Open Firmware did something similar, so I had a look at how GRUB does this and it used a ppc assembler snippet that jumps to grub_ieee1275_entry_fn(), yaboot does a similar thing. I was already grumbling of having to look into how to embed an asm binary to my Rust project. But turns out this snippet conforms to the PPC function calling convention, and since those snippets mostly take care of zeroing the BSS segment but turns out the ELF Rust outputs does not generate one (although I am not sure this means there isn’t a runtime one, I need to investigate this further), I decided to just create a small ppc32be ELF binary with the start function into the top of the .text section at address 0x10000.
I have created a repository with the most basic setup that you can run. With some cargo configuration to get the right linking options, and a script to create the disk image with the ELF payload on the PReP partition and run qemu, we can get this source code being run by Open Firmware:
#![no_std]
#![no_main]
use core::{panic::PanicInfo, ffi::c_void};
#[panic_handler]
fn _handler (_info: &PanicInfo) -> ! {
loop {}
}
#[no_mangle]
#[link_section = ".text"]
extern "C" fn _start(_r3: usize, _r4: usize, _entry: extern "C" fn(*mut c_void) -> usize) -> isize {
loop {}
}
Provided we have already created the disk image (check the run_qemu.sh script for more details), we can run our code by executing the following commands:
$ cargo +nightly build --release --target powerpc-unknown-linux-gnu
$ dd if=target/powerpc-unknown-linux-gnu/release/openfirmware-basic-entry of=disk.img bs=512 seek=2048 conv=notrunc
$ qemu-system-ppc64 -M pseries -m 512 --drive file=disk.img
[...]
Welcome to Open Firmware
Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
This program and the accompanying materials are made available
under the terms of the BSD License available at
http://www.opensource.org/licenses/bsd-license.php
Trying to load: from: /vdevice/v-scsi@71000003/disk@8000000000000000 ... Successfully loaded
Ta da! The wonders of getting your firmware to run an infinite loop. Here’s where the fun begins.
Doing something actually useful
Now, to complete the hello world, we need to do something useful. Remeber our _entry argument in the _start() function? That’s our gateway to the firmware functionality. Let’s look at how the IEEE1275 spec tells us how we can work with it.This function is a universal entry point that takes a structure as an argument that tells the firmware what to run, depending on the function it expects some extra arguments attached. Let’s look at how we can at least print “Hello World!” on the firmware console.
The basic structure looks like this:
#[repr(C)]
pub struct Args {
pub service: *const u8, // null terminated ascii string representing the name of the service call
pub nargs: usize, // number of arguments
pub nret: usize, // number of return values
}
This is just the header of every possible call, nargs and nret determine the size of the memory of the entire argument payload. Let’s look at an an example to just exit the program:
#[no_mangle]
#[link_section = ".text"]
extern "C" fn _start(_r3: usize, _r4: usize, entry: extern "C" fn(*mut Args) -> usize) -> isize {
let mut args = Args {
service: "exit\0".as_ptr(),
nargs: 0,
nret: 0
};
entry (&mut args as *mut Args);
0 // The program will exit in the line before, we return 0 to satisfy the compiler
}
When we run it in qemu we get the following output:
Trying to load: from: /vdevice/v-scsi@71000003/disk@8000000000000000 ... Successfully loaded
W3411: Client application returned.
Aha! We successfully called firmware code!
To be continued…
To summarize, we’ve learned that we don’t really need assembly code to produce an entry point to our OF bootloader (tho we need to zero our bss segment if we have one), we’ve learned how to build a valid OF ELF for the PPC architecture and how to call a basic firmware service.In a follow up post I intend to show a hello world text output and how the ieee1275 crate helps to abstract away most of the grunt to access common firmware services. Stay tuned!
After Action Report
The DRIL merge is done, and things are mostly working again after a tumultuous week. To recap, here’s everything that went wrong leading up to 24.2-rc1, the reason why it went wrong, and the potential steps that could be taken (but almost certainly won’t) to avoid future issues.
Library Paths
One of the big changes that went in last-minute was a MR linking all the GL frontend libs to Gallium, which is a huge improvement to the old way of using dlopen to directly trigger version mismatch errors.
It had some problems, like how it broke Steam. As some readers may have inferred, this was Very Bad, as my employer has some interest in ensuring that Steam does not break.
The core problem in this case has to do with library paths, distro policies, and Steam’s own library handling:
Mesa’s libGLX/libEGL/libgbm all link directly to libgallium.so now, which means this library must be in the library path
Traditionally, libgallium.so has been installed to ${libdir}/dri
I initially suggested installing it to ${libdir} to avoid library pathing issues, but the criticism I received was that distros would not be friendly towards shipping an unstable library here
Thus, I came upon the decision to use rpath to ensure the dri directory was appended to the library path for libgallium.so
Unfortunately, there are lots of things that don’t fully handle all variations of rpath, chief among them Steam. Furthermore, some distros don’t use the most optimal implementation of rpath (i.e., they use DT_RPATH instead of DT_RUNPATH), which hits those unimplemented parts of Steam.
The reason(s) this managed to land without issues?
I was juggling the MR across multiple repos during final testing and CI mashing when I was trying to get it landed, and an intermediate version of the MR landed which updated all the CI LD_LIBRARY_PATH variables to include ${libdir}/dri which I had used for a test run but did not intend to land with the final version
My test machines all add a lot of extra directories to my LD_LIBRARY_PATH to avoid random issues when testing obscure apps
Combined, I wasn’t getting adequate testing, so it appeared everything was fine when really nothing was fine.
Lucky for me, Simon McVittie wrote a full textbook analysis of the issue and possible solutions, so this is now fixed.
Ideally in the future I’ll have better testing environments and won’t be trying to hammer in big MRs minutes before a RC goes out.
FB Configs Went Missing
DRI is (now) a simple interface that tells Xorg which rendering formats can be used for drawables. This is dependent on the device and driver, but fbconfigs aren’t typically something that should vary too much between driver versions. DRIL is meant to split this functionality out of the rest of Mesa so that all the internal interfaces don’t have to be a Gordian Knot.
Unfortunately, this means if DRIL has problems determining which formats are usable, the xserver also has problems. There were a lot of problems:
The original implementation used some pretty suboptimal looping to calculate valid configs (If you’re ever using eglChooseConfigs, you’re probably fucking up) which made it hard to adequately review
The hardcoded list of valid configs was very limited; basically you got 8/8/8/8 with a couple ZS variants, or you got 10/10/10/2
No double-buffered configs
No sRGB variants
This is why there was a sudden deluge of issues about broken colors
On my end, I didn’t check glxinfo output thoroughly enough, nor did I do an exceptionally thorough testing of desktop apps. All the unit tests passed along with CI, which seemed like it should have been enough. Too bad there are no piglit tests which check to see whether various fbconfigs are supported. Maybe I’ll write one to ensure there’s a CI baseline and catch any future regressions.
Drivers Stopped Loading
This is a pretty dumb issue, but it was an issue nonetheless: drivers simply stopped loading. This affected any number of embedded (etnaviv) devices and was fixed by a pretty trivial MR. Also I broke KMSRO, which broke even more devices.
Whoops.
The problem here is there’s no CI testing, and I have no such devices for testing. Hard to evaluate these types of things when they silently fail.
But Now We’re All Good.
I promise.
I have been doing random coding experiments with my spare time that I never got to publicize much outside of my inner circles. I thought I would undust my blog a bit to talk about what I did in case it is useful for others.For some background, I used to manage the bootloader team at Red Hat a few years ago alongside Peter Jones and Javier Martinez. I learned a great deal from them and I fell in love with this particular problem space and I have come to enjoy tinkering with experiments in this space.There many open challenges in this space that we could use to have a more robust bootpath across Linux distros, from boot attestation for initramfs and cmdline, A/B rollbacks, TPM LUKS decryption (ala BitLocker)…One that particularly interests me is unifying the firmware-kernel boot interface across implementations in the hypothetical absence of GRUB.
Context: the issue with GRUB
The priority of the team was to support RHEL boot path on all the architectures we supported. Namely x86_64 (legacy BIOS & UEFI), aarch64 (UEFI), s390x and ppc64le (Open Power and PowerVM).
These are extremely heterogeneous firmware interfaces, some are on their way to extinction (legacy PC BIOS) and some will remain weird for a while.GRUB, (GRand Unified Bootloader) as it names stands, intends to be a unified bootloader for all platforms. GRUB has to support a supersetq of firmware interfaces, some of those, like legacy BIOS do not support much other than some rudimentary support disk or network access and basic graphics handling.
To get to load a kernel and its initramfs, this means that GRUB has to implement basic drivers for storage, networking, TCP/IP, filesystems, volume management… every time there is a new device storage technology, we need to implement a driver twice, once in the kernel and once in GRUB itself. GRUB is, for all intent and purposes, an entire operating system that has to be maintained.
The maintenance burden is actually quite big, and recently it has been a target for the InfoSec community after the Boot Hole vulnerability. GRUB is implemented in C and it is an extremely complex code base and not as well staffed as it should. It implements its own scripting language (parser et al) and it is clear there are quite a few CVEs lurking in there.
So, we are basically maintaining code we already have to write, test and maintain in the Linux kernel in a different OS whose whole purposes (in the context of RHEL, CentOS and Fedora) its main job is to boot a Linux kernel.This realization led to the initiative that these days are taking shape in the discussions around nmbl (no more boot loader). You can read more about that in that blog post, I am not actively participating in that effort but I encourage you to read about it. I do want to focus on something else and very specific, which is what you do before you load the nmble kernel.
Booting from disk
I want to focus on the code that goes from the firmware interface to loading the kernel (nmbl or otherwise) from disk. We want some sort of A/B boot protocol that is somewhat normalized across the platforms we support, we need to pick the kernel from the disk.
The systemd community has led some of the boot modernization initiatives, vocally supporting the adoption of UKI and signed pre-built initarmfs images, developing the Boot Loader Spec, and other efforts.
At some point I heard Lennart making the point that we should standardize on using the EFI System Partition as /boot to place the kernel as most firmware implementations know how to talk to a FAT partition.This proposal caught my attention and I have been pondering if we could have a relatively small codebase written in a safe language (you know which) that could support a well define protocol for A/B booting a kernel in Legacy BIOS, S390 and OpenFirmware (UEFI and Open Power already support BLS snippets so we are covered there).My modest inroad into testing this hypothesis so far has been the development of ieee1275-rs, a Rust module to write programs for the Open Firmware interface, so far I have not been able to load a kernel by myself but I still think the lessons learned and some of the code could be useful to others. Please note this is a personal experiment and nothing Red Hat is officially working on.I will be writing more about the technical details of this crate in a follow up blog post where I get into some of the details of writing Rust code for a firmware interface, this post is long enough already. Stay tuned.
I’m Cookin
Lot of stuff happening. I can’t talk about much of it yet, but trust me when I say the following:
It’s happening.
When it happens, you’ll know what I meant.
Today Is A Great Day
Remember way back when I put DRI interfaces on notice?
Now, only four months later, DRI interfaces are finally going away.
Begun by @ajax and then finished off by me and Pavel (ghostwritten by @daniels), the DRIL (DRI Legacy) interface is a tiny shim which matches Xorg’s ABI expectations to provide a list of sensible fbconfig formats during startup. Then it does nothing. And by doing nothing, it saves the rest of Mesa from being shackled to ancient ABI constraints.
Let the refactoring begin.
But Wait, There’s More!
Obviously I’m not going to stop here. SGC leaves no code half-krangled. That’s why, as soon as DRIL lands, I’ll also be hammering in this followup MR which finally makes all the GL frontends link directly to the Gallium backend driver.
Why is this so momentous, you ask? How many of you have gotten the error DRI driver not from this Mesa build when trying to use your custom Mesa build?
With this MR, that error is going away. Permanently. Now you can have as many Mesa builds on your system as you want. No longer do you need to set LIBGL_DRIVERS_PATH for any reason.
The future is here.
Hi!
This month wlroots 0.18.0 has been released! This new version includes a fair
share of niceties: ICC profiles, GPU reset recovery, less black screens when
plugging in a monitor on Intel, a whole bunch of new protocol implementations,
and much more. Thanks a lot to all contributors! Two recent merge requests made
it in the release: Kenny’s Vulkan renderer optimizations, and support for the
SIZE_HINTS KMS property to use a smaller cursor plane on Intel to save power.
For the next release we’ll be trying out release candidates to formally focus
on bugfixing and leave time for compositors and language bindings to update and
report issues.
I’ve continued working on various graphics-related topics, for instance the
wlroots implementation of the upcoming ext-screencopy-v1 protocol is now
complete and the protocol itself is almost ready (still figuring out the most
difficult part: how to name it). I also sent out a kernel patch to fix tearing
page-flips when cursor/overlay planes don’t change (and are included in the
atomic commit). I reviewed patches by Enrico Weigelt to improve libdrm’s
portability to OpenBSD and Solaris. Last, I’ve released libdisplay-info 0.2.0
with a new high-level API for colorimetry and support for more
EDID/CTA/DisplayID blocks.
To get the releases over with, let’s briefly mention Goguma 0.7.0. This one
unlocks file uploads, a new look based on Material You with an adaptive color
scheme, many improvements to the iOS port, and text/media can be shared to
Goguma from other apps. slingamn has played with a gamja/Ergo setup configured
with Forgejo as an OAuth server, and it worked nicely after fixing a gamja
SASL-related bug and implementing a missing feature in Forgejo’s OAuth token
introspection endpoint!
Last, I also added a new libscfg API to write files - this can be useful to
auto-generate some configuration files for instance. And I also performed some
more boring X.Org Foundation sysadmin stuff, such as dealing with
domain-related issues, recovering a server running out of disk space again, and
convincing Postfix to start up.
See you next month!
The Igalia Graphics team has been expanding and making significant contributions in the space of open source graphics. An earlier blog post by our team member Lucas provides an excellent insight in to the team’s evolution over the past years.
The following series of posts will attempt to summarize the team’s recent engagements:
This post covers our updates on GPU color management, Turnip, V3DV, DRM/KMS, Etnaviv and community events we have been participating in.
The next post will cover news from our CTS, Vulkan Video, Mesa CI, GPU reset work and talks about some new initiatives that recently we got involved in.
Before dwelling in to details, it is worth mentioning the recent highlights; Igalia hosted 2024 Linux Display Next Hackfest in May this year and X.org Developers Conference 2023 in October last year, both in the beautiful city of A Coruña. These events were a huge success in creating a hub for graphics experts to foster open innovation. Continue reading for more details on these events.
A Vibrant Linux #
Last year brought great news for AMD GPU color management: the AMD driver-specific color management properties reached the upstream linux-next! My Igalia colleague Melissa Wen has been spearheading this effort for some time now and has journalled every detail in a series of blog posts.
AMD has been improving its display color management pipeline with each new hardware generation. The new color capabilities, before and after plane composition, can be used by compositors and userspace applications to provide a vibrant experience to the end-user. Exposing AMD driver-specific color properties is a step towards advanced color management on Linux, allowing gamut mapping, HDR rendering, HDR on SDR, and SDR on HDR.
On a very high level, there are 2 parts of this support:
Upgrading the DRM/KMS Linux interface to expose the new features to the user-space.
One major challenge was the limited DRM/KMS interface, which only exposed a small set of post-blending color properties. Latest AMD Display Core Next hardware has many more post-blending and pre-blending capabilities. Melissa’s work involved mapping these capabilities to the AMD driver’s display core interface and then to the DRM interface. Her blog post provides a brief overview of this extensive mapping effort.
Updating the AMD’s Linux display driver to expose the new hardware features.
AMD DCN 3.0 comes with cutting edge color capabilities described by Melissa here and this blog post also talks about the AMD’s Linux display subsystem components and about the new properties.
I quote here some of Melissa’s write-ups that helped me get some understanding about this vast subject:
Navigating the Linux display subsystem
Melissa’s XDC2023 talk
Turnip Upgrades #
Turnip, the open-source Vulkan driver for Qualcomm Adreno GPUs, has been receiving major upgrades this year for Qualcomm’s Adreno 7XX GPUs.
From my colleague Danylo Piliaiev’s Turnip update at FOSDEM 2024, Turnip seems to be in a great state; major Vulkan extensions and better debug support, AAA desktop games can now run via FEX + Turnip on Linux, with some from the Termux community even running desktop games on Android with Box64/FEX + Turnip.
The highlight of Danylo’s talk is the A7XX support. The team started the year with A7XX bring up and now ramping on adding support for the new features introduced in A7XX:
Mark Collins, who also represents Igalia at the Khronos Vulkan WG, implemented GMEM rendering for A7XX, which can be considerably faster and more power efficient than sysmem rendering depending on what’s being rendered. Followed up by support for unidirectional LRZ, bringing A7XX to parity with A6XX’s GMEM rendering feature set and further boosting performance, with more performance improvements for A7XX on the horizon.
Our colleague Amber Harmonia added support for allowing a shader to contain 64-bit atomic operations on signed and unsigned integers and support for allowing rasterizing wide lines while Fixed Stride Draw Table support is work-in-progress.
In addition to new feature support, we are committed to providing a robust and performant driver.
Recently, Job Noorman has joined our Turnip team to improve the IR3 compiler. He improved handling of predicate registers and added support for predication. Adreno GPUs have special registers that store the result of a condition called predicate registers, utilizing these registers can eliminate branches in the generated code thereby improving performance. Similarly, more than 10% code size reduction was observed in shader-db with his patch for using rptN instructions.
Turnip has come far and has been giving competition to the Adreno’s proprietary driver recently. Here is Assassin’s Creed running on Adreno + Turnip. Check the FPS on that screen!
Turnip Development Resources #
Danylo usually talks about analyzing some of the major Turnip issues in his series of blog posts “Turnips in the wild” with part 3 being the latest addition. This is exactly what you need to jump start Turnip development.
As always, the team also discovered many new techniques of debugging GPU issues. GPU driver developers want to modify the GPU command stream on run-time to see the outcome of editing it in different ways. Danylo implemented this highly sought out feature as a tool for Adreno and describes how this tool can be used.
DRM/KMS Improvements #
The management of the display, graphics and composition in Linux lies in the kernel DRM/KMS framework.
Igalian Maíra Canal provides full disclosure on our notable contributions authoring, reviewing and testing kernel DRM patches while I privide a few highlights here:
My Igalia colleague André Almeida and Simon Ser have been working on Asynchronous Page Flips, an optimization that allows applications to flip a plane for immediate presentation. The support for this feature is now available in the atomic API. Plus, with André’s patch, it is enabled for all planes including the primary plane if the hardware supports it.
Maíra has been working on feature crucial to graphics development on RPi. She supplied per client GPU usage statistics as well as global GPU utilization.
In order to ensure continuous job submission to the GPU, CPU jobs submitted from userspace must be prevented. With a series of patches from Maíra moved CPU jobs mechanisms from the V3DV driver to the V3D kernel driver.
We want more Pi! #
After achieving Vulkan 1.2 conformance on V3DV, the Igalia team working on V3DV have been focusing on instrumental enhancements of the driver. V3DV is Broadcom Video Core GPU’s Vulkan driver on the
RPi 5 was launched in October last year with a new BCM GPU. Alejandro provided an overview of the team’s journey through V3DV development since RPi 4 and then talks about challenges of RPi 5 support in V3DV:
More improvements and new Vulkan extensions were supported last year.
This year Iago landed support for Vulkan dynamic rendering extension. VK_KHR_dynamic_rendering is a popular Vulkan extension that has added flexibility to the Vulkan API by allowing users to skip render pass and frame buffer objects and start immediate rendering. And now its available on the Pi.
As mentioned in the DRM/KMS improvements above, Maíra together with José María Casanova (Chema) and Melissa supported GPU utilization stats and CPU jobs optimization. Here is a snapshot of collection of GPU stats on Pi5:
RPi 5 continues to use OpenGL/Wayland based Wayfire compositor on these devices. Christopher was therefore tasked with enabling Wayfire to run on RPi 3 and 4 as well. He achieved this by software rendering implementing by a Pixman back-end. Check out the demo:
Iago also made some interesting observations while experimenting with SuperTuxKart on the Pi. You will be pleasantly surprised to know how Vulkan out-performed OpenGL.
The team has been working towards Vulkan 1.3 and we will hopefully be able to share more news on that front very soon.
Etnaviv #
Christian Gmeiner, one of the maintainers of Etnaviv (open-source graphics driver for Vivante GPUs), joined our team last year. We are very excited to have him on-board because it is a testament to Igalia’s dedication towards open source graphics software development.
Christian is also enjoying being at Igalia as he discusses in blog post and also reveals his plans for Etnaviv:
Improving Etnaviv’s Gallium driver.
Exposing GLES3.
Moving towards a new back-end compiler.
One of his latest updates is the user-space hardware database. He explains that a user-space driver HW database has been introduced to obtain GPU specific information like GPU features and limits, corresponding to the introduction of an in-kernel hardware database. I am sure this will be super helpful for the reverse engineers out there!
News & Community Events #
Igalians are always eager to share their knowledge and expertise with the open source community by participating in key organizations and events.
Good bye ‘Xorg’ and Hello ‘Linux Foundation’ #
There is quite a trend in Igalians serving on the X.Org Foundation’s Board of Directors. Samuel Iglesias took on this responsibility for a number of terms but this year he is stepping down. He reminisced about his role in this blog post.
Ricardo was, however, elected as one of the board of directors in 2022 and stayed on the board till Q1 2024, leaving Christopher Michael as the only Igalian currently on the board. In his blog post, Ricardo introduces the X.Org Foundation but also tackles some questions about its future.
Samuel was invited to join the Linux Foundation (Europe) advisory board and he has accepted the invitation. This is a huge milestone for the whole graphics team. Congratulations Sam!
2024 Linux Display Hackfest #
This is a rather new event that has materialized in the Linux community to enhance the Linux display stack.
Melissa’s work on HDR and AMD color management together with interesting discussions during XDC 2023 Color Management workshop paved the way for the event this year and therefore, Igalia graciously offered to host it.
The event attracted key participants from Linux community, AMD, Nvidia, Google, Fedora, and Gnome, focusing on topics like HDR/color Management, variable refresh rate, tearing, multiplane/hardware overlay for video and gaming, real-time scheduling, async KMS API, power saving vs. color/latency, content-adaptive scaling and sharpening, and display control. The success of this event has highlighted the need for future editions.
Embedded Open Source Summit 2024 #
At EOSS this year, we presented the following talks:
Alejandro Piñeiro, Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driver for a New GPU
FOSDEM 2024 #
At FOSDEM this year, we presented the following talks:
Danylo Piliaiev, “turnip: Update on Open Source Vulkan Driver for Adreno GPUs”
José María Casanova Crespo, Juan A. Suarez, “Graphics stack updates for Raspberry Pi devices”
Vukanised 2024 #
At Vukanised this year, we presented the following talks:
Stéphane Cerveau & Hyunjun Ko, “Implementing a Vulkan Video Encoder From Mesa to Streamer”
Iago Toral, Faith Ekstrand, “8 Years of Open Drivers, including the State of Vulkan in Mesa”
Igalians who attended the event found it quite informative on the subject.
XDC 2023 #
Igalia hosted XDC 2023 in the city of their headquarters, A Coruña. We also presented many talks and demos.
Melissa Wen, “The rainbow treasure map: advanced color management on Linux with AMD/Steam Deck”
Danylo Piliaiev, “Debugging GPU faults: QoL tools for your driver"
Eric Engestrom with Martin Roukala and David Heidelberg, “Hosting a CI system at home - Slaying the regression dragon to bring stability to driver kingdom”
Iago Toral, Juan A. Suarez, Maíra Canal, “On-going challenges in the Raspberry Pi driver stack: OpenGL 3, Vulkan and more”
Maíra Canal, Melissa Wen, “Status Update of the VKMS DRM driver”
André Almeida, “Having fun with GPU resets in Linux”
Lucas Fryzek, “Freedreno on Android”
Christian Gmeiner, “etnaviv: status update”
The lightning talks and demos had an equally active participation from Igalia:
Christopher Michael, “Wayfire - Making an OpenGL Wayland compositor render using Pixman”
Guilherme G. Piccoli, “To crash or not to crash: if you do, at least recover fast!”
Charles Turner, “Status of the Vulkan Video ecosystem”
Alejandro Piñeiro, “v3dv: experience using gfxreconstruct/apitrace traces for performance evaluation”
Eric Engestrom “Being a Mesa release maintainer”
Workshops were organized for discussion on larger subjects like advance color management (discussion summary) and continuous integration (discussion summary).
The Future #
Igalia graphics team has profound expertise in Mesa, Vulkan, OpenGL and Linux kernel. We have also embraced new and really interesting graphics technologies that I talk about in my next post.
NoteThis blog post is part 1 of a series of blog posts about isaspec and its usage in the etnaviv GPU stack.
I will add here links to the other blog posts, once they are published.
The first time I heard about isaspec, I was blown away by the possibilities it opens. I am really thankful that Igalia made it possible to complete this crucial piece of core infrastructure for the etnaviv GPU stack.
If isaspec is new to you, here is what the Mesa docs have to tell about it:
isaspec provides a mechanism to describe an instruction set in XML, and generate a disassembler and assembler. The intention is to describe the instruction set more formally than hand-coded assembler and disassembler, and better decouple the shader compiler from the underlying instruction encoding to simplify dealing with instruction encoding differences between generations of GPU.
Benefits of a formal ISA description, compared to hand-coded assemblers and disassemblers, include easier detection of new bit combinations that were not seen before in previous generations due to more rigorous description of bits that are expect to be ‘0’ or ‘1’ or ‘x’ (dontcare) and verification that different encodings don’t have conflicting bits (i.e. that the specification cannot result in more than one valid interpretation of any bit pattern).
If you are interested in more details, I highly recommend Rob Clark’s introduction to isaspec presentation.
Target ISA
Vivante uses a fixed-size (128 bits), predictable instruction format with explicit inputs and outputs.
As of today, there are three different encodings seen in the wild:
Base Instruction Set
Extended Instruction Set
Enhanced Vision Instruction Set (EVIS)
Why do I want to switch to isaspec
There are several reasons..
The current state
The current ISA documentation is not very explicit and leaves lot of room for interpretation and speculation. One thing that it provides, are some nice explanations what an instruction does. isaspec does not support <doc> tags yet, but I there is a PoC MR that generates really nice looking and information ISA documentation based on the xml.
I think soon you might find all etnaviv’s isaspec documentation at docs.mesa3d.org.
No unit tests
There are no unit tests based on instructions generated by the blob driver. This might not sound too bad, but it opens the door to generating ‘bad’ encoded instructions that could trigger all sorts of weird and hard-to-debug problems. Such breakages could be caused by some compiler rework, etc.
In an ideal world, there would be a unit test that does the following:
Disassembles the binary representation of an instruction from the blob to a string representation.
Verifies that it matches our expectation.
Assembles the string representation back to 128 bits.
Verifies that it matches the binary representation from the blob driver.
This is our ultimate goal, which we really must reach. etnaviv will not be the only driver that does such deep unit testing - e.g. freedreno does it too.
Easier to understand code
Do you remember the rusticl OpenCL attempt for etnaviv? It contains lines like:
if (nir_src_is_const(intr->src[1])) {
inst.tex.swiz = 128;
}
if (rmode == nir_rounding_mode_rtz)
inst.tex.amode = 0x4 + INST_ROUND_MODE_RTZ;
else /*if (rmode == nir_rounding_mode_rtne)*/
inst.tex.amode = 0x4 + INST_ROUND_MODE_RTNE;
Do you clearly see what is going on? Why do we need to set tex.amode for an ALU instruction?
I always found it quite disappointing to see such code snippets. Sure, they mimic what the blob driver is doing, but you might lose all the knowledge about why these bits are used that way days after you worked on it. There must be a cleaner, more understandable, and thus more maintainable way to document the ISA better.
This situation might become even worse if we want to support the other encodings and could end up with more of these bad patterns, resulting in a maintenance nightmare.
Oh, and if you wonder what happened to OpenCL and etnaviv - I promise there will be an update later this year.
Python opens the door to generate lot of code
As isaspec is written in Python, it is really easy to extend it and add support for new functionality.
At its core, we can generate a disassembler and an assembler based on isaspec. This alone saves us from writing a lot of code that needs to be kept in sync with all the ISA reverse engineering findings that happen over time.
As isaspec is just an ordinary XML file, you can use any programming language you like to work with it.
One source of truth
I really fell in love with the idea of having one source of truth that models our target ISA, contains written documentation, and extends each opcode with meta information that can be used in the upper layers of the compiler stack.
Missing Features
I think I have sold you the idea quite well, so it must be a matter of some days to switch to it.
Sadly no, as there are some missing features:
Only max 64 bits width ISAs are supported
Its home in src/freedreno
Alignment support is missing
No <meta> tags are supported
Add support for 128 bit wide instructions
The first big MR I worked on, extended BITSET APIs with features needed for isaspec.
Here we are talking about bitwise AND, OR, and NOT, and left shifts.
The next step was to switch isaspec to use the BITSET API to support wider ISAs. This resulted in a lot of commits, as there was a need for some new APIs to support handling this new feature. After these 31 commits, we were able to start looking into isaspec support for etnaviv.
Decode Support
Now it is time to start writing an isaspec XML for etnaviv, and the easiest opcode to start with is the nop. As the name suggests, it does nothing and has no src’s, no dst, or any other modifier.
As I do not have this initial version anymore, I tried to recreate it - it might have looked something like this:
<?xml version="1.0" encoding="UTF-8"?>
<isa>
<bitset name="#instruction">
<display>
{NAME} void, void, void, void
</display>
<pattern low="6" high="10">00000</pattern>
<pattern pos="11">0</pattern>
<pattern pos="12">0</pattern>
<pattern low="13" high="26">00000000000000</pattern>
<pattern low="27" high="31">00000</pattern>
<pattern pos="32">0</pattern>
<pattern pos="33">0</pattern>
<pattern pos="34">0</pattern>
<pattern low="35" high="38">0000</pattern>
<pattern pos="39">0</pattern>
<pattern low="40" high="42">000</pattern>
<!-- SRC0 -->
<pattern pos="43">0</pattern> <!-- SRC0_USE -->
<pattern low="44" high="52">000000000</pattern> <!-- SRC0_REG -->
<pattern pos="53">0</pattern>
<pattern low="54" high="61">00000000</pattern> <!-- SRC0_SWIZ -->
<pattern pos="62">0</pattern> <!-- SRC0_NEG -->
<pattern pos="63">0</pattern> <!-- SRC0_ABS -->
<pattern low="64" high="66">000</pattern> <!-- SRC0_AMODE -->
<pattern low="67" high="69">000</pattern> <!-- SRC0_RGROUP -->
<!-- SRC1 -->
<pattern pos="70">0</pattern> <!-- SRC1_USE -->
<pattern low="71" high="79">000000000</pattern> <!-- SRC1_REG -->
<pattern low="81" high="88">00000000</pattern> <!-- SRC1_SWIZ -->
<pattern pos="89">0</pattern> <!-- SRC1_NEG -->
<pattern pos="90">0</pattern> <!-- SRC1_ABS -->
<pattern low="91" high="93">000</pattern> <!-- SRC1_AMODE -->
<pattern pos="94">0</pattern>
<pattern pos="95">0</pattern>
<pattern low="96" high="98">000</pattern> <!-- SRC1_RGROUP -->
<!-- SRC2 -->
<pattern pos="99">0</pattern> <!-- SRC2_USE -->
<pattern low="100" high="108">000000000</pattern> <!-- SRC2_REG -->
<pattern low="110" high="117">00000000</pattern> <!-- SRC2_SWIZ -->
<pattern pos="118">0</pattern> <!-- SRC2_NEG -->
<pattern pos="119">0</pattern> <!-- SRC2_ABS -->
<pattern pos="120">0</pattern>
<pattern low="121" high="123">000</pattern> <!-- SRC2_AMODE -->
<pattern low="124" high="126">000</pattern> <!-- SRC2_RGROUP -->
<pattern pos="127">0</pattern>
</bitset>
<!-- opcocdes sorted by opc number -->
<bitset name="nop" extends="#instruction">
<pattern low="0" high="5">000000</pattern> <!-- OPC -->
<pattern pos="80">0</pattern> <!-- OPCODE_BIT6 -->
</bitset></isa>
With the knowledge of the old ISA documentation, I went fishing for instructions. I only used instructions from the binary blob for this process. It is quite important for me to have as many unit tests as I can write to not break any decoding with some isaspec XML changes I do. And it was a huge lifesaver at that time.
After I reached almost feature parity with the old disassembler, I thought it was time to land etnaviv.xml and replace the current handwritten disassembler with a generated one - yeah, so I submitted an MR to make the switch.
As this is only a driver internal disassembler used by maybe 2-3 human beings, it would not be a problem if there were some regressions.
Today I would say the isaspec disassembler is superior to the handwritten one.
Encode Support
The next item on my list was to add encoding support. As you can imagine, there was some work needed upfront to support ISAs that are bigger than 64 bits. This time the MR only contains two commits 😄.
With everything ready it is time to add isaspec based encoding support to etnaviv.
The goal is to drop our custom (and too simple) assembler and switch to one that is powered by isaspec.
This opens the door to:
Modeling special cases for instructions like a branch with no src’s to a new jump instruction.
Doing the NIR src -> instruction src mapping in isaspec.
Supporting different instruction encodings.
Adding meta information to instructions.
Supporting special instructions that are used in compiler unit tests
In the end, all the magic that is needed is shown in the following diff:
diff --git a/src/etnaviv/isa/etnaviv.xml b/src/etnaviv/isa/etnaviv.xml
index eca8241a2238a..c9a3ebe0a40c2 100644
--- a/src/etnaviv/isa/etnaviv.xml
+++ b/src/etnaviv/isa/etnaviv.xml
@@ -125,6 +125,13 @@ SPDX-License-Identifier: MIT
<field name="AMODE" low="0" high="2" type="#reg_addressing_mode"/>
<field name="REG" low="3" high="9" type="uint"/>
<field name="COMPS" low="10" high="13" type="#wrmask"/>
+
+ <encode type="struct etna_inst_dst *">
+ <map name="DST_USE">p->DST_USE</map>
+ <map name="AMODE">src->amode</map>
+ <map name="REG">src->reg</map>
+ <map name="COMPS">p->COMPS</map>
+ </encode>
</bitset>
<bitset name="#instruction" size="128">
@@ -137,6 +144,46 @@ SPDX-License-Identifier: MIT
<derived name="TYPE" type="#type">
<expr>{TYPE_BIT2} << 2 | {TYPE_BIT01}</expr>
</derived>
+
+ <encode type="struct etna_inst *" case-prefix="ISA_OPC_">
+ <map name="TYPE_BIT01">src->type & 0x3</map>
+ <map name="TYPE_BIT2">(src->type & 0x4) > 2</map>
+ <map name="LOW_HALF">src->sel_bit0</map>
+ <map name="HIGH_HALF">src->sel_bit1</map>
+ <map name="COND">src->cond</map>
+ <map name="RMODE">src->rounding</map>
+ <map name="SAT">src->sat</map>
+ <map name="DST_USE">src->dst.use</map>
+ <map name="DST">&src->dst</map>
+ <map name="DST_FULL">src->dst_full</map>
+ <map name="COMPS">src->dst.write_mask</map>
+ <map name="SRC0">&src->src[0]</map>
+ <map name="SRC0_USE">src->src[0].use</map>
+ <map name="SRC0_REG">src->src[0].reg</map>
+ <map name="SRC0_RGROUP">src->src[0].rgroup</map>
+ <map name="SRC0_AMODE">src->src[0].amode</map>
+ <map name="SRC1">&src->src[1]</map>
+ <map name="SRC1_USE">src->src[1].use</map>
+ <map name="SRC1_REG">src->src[1].reg</map>
+ <map name="SRC1_RGROUP">src->src[1].rgroup</map>
+ <map name="SRC1_AMODE">src->src[1].amode</map>
+ <map name="SRC2">&src->src[2]</map>
+ <map name="SRC2_USE">rc->src[2].use</map>
+ <map name="SRC2_REG">src->src[2].reg</map>
+ <map name="SRC2_RGROUP">src->src[2].rgroup</map>
+ <map name="SRC2_AMODE">src->src[2].amode</map>
+
+ <map name="TEX_ID">src->tex.id</map>
+ <map name="TEX_SWIZ">src->tex.swiz</map>
+ <map name="TARGET">src->imm</map>
+
+ <!-- sane defaults -->
+ <map name="PMODE">1</map>
+ <map name="SKPHP">0</map>
+ <map name="LOCAL">0</map>
+ <map name="DENORM">0</map>
+ <map name="LEFT_SHIFT">0</map>
+ </encode>
</bitset>
<bitset name="#src-swizzle" size="8">
@@ -148,6 +195,13 @@ SPDX-License-Identifier: MIT
<field name="SWIZ_Y" low="2" high="3" type="#swiz"/>
<field name="SWIZ_Z" low="4" high="5" type="#swiz"/>
<field name="SWIZ_W" low="6" high="7" type="#swiz"/>
+
+ <encode type="uint8_t">
+ <map name="SWIZ_X">(src & 0x03) >> 0</map>
+ <map name="SWIZ_Y">(src & 0x0c) >> 2</map>
+ <map name="SWIZ_Z">(src & 0x30) >> 4</map>
+ <map name="SWIZ_W">(src & 0xc0) >> 6</map>
+ </encode>
</bitset>
<enum name="#thread">
@@ -272,6 +326,13 @@ SPDX-License-Identifier: MIT
</expr>
</derived>
</override>
+
+ <encode type="struct etna_inst_src *">
+ <map name="SRC_SWIZ">src->swiz</map>
+ <map name="SRC_NEG">src->neg</map>
+ <map name="SRC_ABS">src->abs</map>
+ <map name="SRC_RGROUP">p->SRC_RGROUP</map>
+ </encode>
</bitset>
<bitset name="#instruction-alu-no-src" extends="#instruction-alu">
One nice side effect of this work is the removal of isa.xml.h file that has been part of etnaviv since day one. We are able to generate all the file contents with isaspec and some custom python3 scripts. The move of instruction src swizzling from the driver into etnaviv.xml was super easy - less code to maintain!
Summary
I am really happy with the end result, even though it took quite some time from the initial idea to the point when everything was integrated into Mesa’s main git branch.
There is so much more to share - I can’t wait to publish parts II and III.
Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).This work is sponsored by the open source consultancy Ideas On Boards, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite MobileDet object detection model.Right now all kinds of basic convolutions are supported, and work is well on its way for strided convolutions.For basic convolutions, most of the work was switching to a totally different way of encoding weights. At the low-level, the weights are encoded with Huffman, and zero run length encoding on top. This low level encoding has been already reverse engineered and implemented by Philipp Zabel of Pengutronix, as mentioned in my previous update on the variant of this NPU shipped inside the Amlogic S905D3.How weights are laid on top of the encoding is also different, so I had to reverse engineer that and implement it in the Mesa driver. That plus some changes on how tiling is computed got basic convolutions working, then I moved to strided convolutions. Pointwise convolutions got supported at the same time as basic convolutions, as they are not any different on this particular hardware.Strided convolutions are still not natively supported by the hardware, so I reused the code that lowers them to basic convolutions. But the existing jobs that use the tensor manipulation cores to transform the input tensor for strides contained many assumptions that don't hold valid in this hardware.So I have been reverse engineering these differences and now I have all kinds of strided convolutions supported up to 32 output channels. I feel that these will be done after addressing a couple of details about how the tensor reshuffle jobs are distributed among the available TP cores.Afterwards I will look at depthwise convolutions, which may be supported natively by the hardware, while on the A311D these were lowered to basic convolutions.Then on to tensor addition operations, and that should be all that is needed to get SSDLite MobileDet running, hopefully close to the performance of the closed source driver.I'm very grateful to Ideas On Board for sponsoring this work, for their trust on me to get it done, and for their vision of a fully featured mainline platform that all companies can base their products on without being held captive by any single vendor.I'm testing all this on a Verdin iMX8M Plus board that was kindly offered by Daniel Lang at Toradex, thanks!
My current project at Igalia has had me working on Mesa’s software
renderers, llvmpipe and lavapipe. I’ve been working to get them running
on Android, and I wanted to document the progress I’ve made, the
challenges I’ve faced, and talk a little bit about the development
process for a project like this. My work is not totally merged into
upstream mesa yet, but you can see the MRs I made here:
llvmpipe:
Add android platform integration
u_gralloc/fallback:
Set fd from handle directly
llvmpipe
& lavalpipe: Implement sync fd import/export extensions
lavapipe:
Implement VK_EXT_external_memory_dma_buf
Setting up an
Android development environment
Getting system level software to build and run on Android is
unfortunately not straightforward. Since we are doing software rendering
we don’t need a physical device and instead we can make use of the
Android emulator, and if you didn’t know Android has two emulators, the
common one most people use is “goldfish” and the other lesser known is
“cuttlefish”. For this project I did my work on the cuttlefish emulator
as its meant for testing the Android OS itself instead of just Android
apps and is more reflective of real hardware. The cuttlefish emulator
takes a little bit more work to setup, and I’ve found that it only works
properly in Debian based linux distros. I run Fedora, so I had to run
the emulator in a debian VM.
Thankfully Google has good instructions for building and running
cuttlefish, which you can find here.
The instructions show you how to setup the emulator using nightly build
images from Google. We’ll also need to setup our own Android OS images
so after we’ve confirmed we can run the emulator, we need to start
looking at building AOSP.
For building our own AOSP image, we can also follow the instructions
from Google here.
For the target we’ll want
aosp_cf_x86_64_phone-trunk_staging-eng. At this point it’s
a good idea to verify that you can build the image, which you can do by
following the rest of the instructions on the page. Building AOSP from
source does take a while though, so prepare to wait potentially an
entire day for the image to build. Also if you get errors complaining
that you’re out of memory, you can try to reduce the number of parallel
builds. Google officially recommends to have 64GB of RAM, and I only had
32GB so some packages had to be built with the parallel builds set to 1
so I wouldn’t run out of RAM.
For running this custom-built image on Cuttlefish, you can just copy
all the *.img files from
out/target/product/vsoc_x86_64/ to the root cuttlefish
directory, and then launch cuttlefish. If everything worked successfully
you should be able to see your custom built AOSP image running in the
cuttlefish webui.
Building Mesa targeting
Android
Working from the changes in MR !29344
building llvmpipe or lavapipe targeting Android should just work™️. To
get to that stage required a few changes. First llvmpipe actually
already had some support on Android, as long as it was running on a
device that supports a DRM display driver. In that case it could use the
dri window system integration which already works on
Android. I wanted to get llvmpipe (and lavapipe) running without dri, so
I had to add support for Android in the drisw window system
integration.
To support Android in drisw, this mainly meant adding
support for importing dmabuf as framebuffers. The Android windowing
system will provide us with a “gralloc” buffer which inside has a dmabuf
fd that represents the framebuffer. Adding support for importing dmabufs
in drisw means we can import and begin drawing to these frame buffers.
Most the changes to support that can be found in drisw_allocate_textures
and the underlying changes to llvmpipe to support importing dmabufs in
MR !27805.
The EGL Android platform code also needed some changes to use the
drisw window system code. Previously this code would only
work with true dri drivers, but with some small tweaks it was possible
to get to have it initialize the drisw window system and then using it
for rendering if no hardware devices are available.
For lavapipe the changes were a lot simpler. The Android Vulkan
loader requires your driver to have HAL_MODULE_INFO_SYM
symbol in the binary, so that got created and populated correctly,
following other Vulkan drivers in Mesa like turnip. Then the image
creation code had to be modified to support the
VK_ANDROID_native_buffer extension which allows the Android
Vulkan loader to create images using Android native buffer handles.
Under the hood this means getting the dmabuf fd from the native buffer
handle. Thankfully mesa already has some common code to handle this, so
I could just use that. Some other small changes were also necessary to
address crashes and other failures that came up during testing.
With the changes out of of the way we can now start building Mesa on
Android. For this project I had to update the Android documentation for
Mesa to include steps for building LLVM for Android since the version
Google ships with the NDK is missing libraries that llvmpipe/lavapipe
need to function. You can see the updated documentation here
and here.
After sorting out LLVM, building llvmpipe/lavapipe is the same as
building any other Mesa driver for Android: we setup a cross file to
tell meson how to cross compile and then we run meson. At this point you
could manual modify the Android image and copy these files to the vm,
but I also wanted to support building a new AOSP image directly
including the driver. In order to do that you also have to rename the
driver binaries to match Android’s naming convention, and make sure
SO_NAME matches as well. If you check out this
section of the documentation I wrote, it covers how to do that.
If you followed all of that you should have built an version of
llvmpipe and lavapipe that you can run on Android’s cuttlefish
emulator.
Android running lavapipe
References
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29344
Main MR with Android changes
https://source.android.com/docs/devices/cuttlefish/get-started
Google’s official guide for getting started with the Cuttlefish
emulator
https://source.android.com/docs/setup/build/building
Google’s official guide for building AOSP images
https://gitlab.freedesktop.org/mesa/mesa/-/blob/9705df53408777d493eab19e5a58c432c1e75acb/docs/drivers/llvmpipe.rst
My updated documentation in MR for llvmpipe
https://gitlab.freedesktop.org/mesa/mesa/-/blob/9705df53408777d493eab19e5a58c432c1e75acb/docs/android.rst
My updated documentation in MR for Android integration in mesa
Over the last months I've started looking into a few of the papercuts that affects graphics tablet users in GNOME. So now that most of those have gone in, let's see what has happened:
Calibration fixes and improvements (GNOME 47)
The calibration code, a descendent of the old xinput_calibrator tool was in a pretty rough shape and didn't work particularly well. That's now fixed and I've made the calibrator a little bit easier to use too. Previously the timeout was quite short which made calibration quite stressfull, that timeout is now per target rather than to complete the whole calibration process. Likewise, the calibration targets now accept larger variations - something probably not needed for real use-cases (you want the calibration to be exact) but it certainly makes testing easier since clicking near the target is good enough.
The other feature added was to allow calibration even when the tablet is manually mapped to a monitor. Previously this only worked in the "auto" configuration but some tablets don't correctly map to the right screen and lost calibration abilities. That's fixed now too.
A picture says a thousand words, except in this case where the screenshot provides no value whatsoever. But here you have it anyway.
Generic tablet fallback (GNOME 47)
Traditionally, GNOME would rely on libwacom to get some information about tablets so it could present users with the right configuration options. The drawback was that a tablet not recognised by libwacom didn't exist in GNOME Settings - and there was no immediately obvious way of fixing this, the panel either didn't show up or (with multiple tablets) the unrecognised one was missing. The tablet worked (because the kernel and libinput didn't require libwacom) but it just couldn't be configured.
libwacom 2.11 changed the default fallback tablet to be a built-in one since this is now the most common unsupported tablet we see. Together with the new fallback handling in GNOME settings this means that any unsupported tablet is treated as a generic built-in tablet and provides the basic configuration options for those (Map to Monitor, Calibrate, assigning stylus buttons). The tablet should still be added to libwacom but at least it's no longer a requirement for configuration. Plus there's now a link to the GNOME Help to explain things. Below is a screenshot on how this looks like (after modifying my libwacom to no longer recognise the tablet, poor Intuos).
Monitor mapping names (GNOME 47)
For historical reasons, the names of the display in the GNOME Settings Display configuration differed from the one used by the Wacom panel. Not ideal and that bit is now fixed with the Wacom panel listing the name of the monitor and the connector name if multiple monitors share the same name. You get the best value out of this if you have a monitor vendor with short names. (This is not a purchase recommendation).
Highlighted SVGs (GNOME 46)
If you're an avid tablet user, you may have multiple stylus tools - but it's also likely that you have multiple tools of the same type which makes differentiating them in the GUI hard. Which is why they're highlighted now - if you bring the tool into proximity, the matching image is highlighted to make it easier to know which stylus you're about to configure. Oh, and in the process we added a new SVG for AES styli too to make the picture look more like the actual physical tool. The <blink> tag may no longer be cool but at least we can disco our way through the stylus configuration now.
More Pressure Curves (GNOME 46)
GNOME Settings historically presents a slider from "Soft" to "Firm" to adjust the feel of the tablet tip (which influences the pressure values sent to the application). Behind the scenes this was converted into a set of 7 fixed curves but thanks to a old mutter bug those curves only covered a small amount of the possible range. This is now fixed so you can really go from pencil-hard to jelly-soft and the slider now controls an almost-continous range instead of just 7 curves. Behold, a picture of slidery goodness:
Miscellaneous fixes
And of course a bunch of miscellaneous fixes. Things that I quickly found were support for Alt in the tablet pad keymappings, fixing of erroneous backwards movement when wrapping around on the ring, a long-standing stylus button mismatch, better stylus naming and a rather odd fix causing configuration issues if the eraser was the first tool ever to be brought into proximity.
There are a few more things in the pipe but I figured this is enough to write a blog post so I no longer have to remember to write a blog post about all this.
You Would Not Believe This Month
I don’t have a lot of time. There’s a gun to my head. Literally.
John Eldenring is here, and he has a gun pointed at my temple, and he’s telling me that if I don’t start playing his new downloadable content now, I won’t be around to make any more posts.
The Game
Today’s game is Blazblue Centralfiction. I don’t know what this game is, I don’t know what it’s about, I don’t even know what genre it is, but it’s got problems.
What kinds of problems?
This is a DX game that embeds video files. In proton, GStreamer is used to play the video files while DXVK goes brrrr in the background. GStreamer uses GL. What do you think this looks like under perf?
The Problem
GStreameur here does the classic memcpy PIXEL_PACK_BUFFER texture upload -> glReadPixels memcpy download in order to transcode the video files into something renderable. I say classic because this is a footgun on both ends:
each upload is full-frame (huge)
each download is full-frame (huge)
This results in blocking at both ends of the transcode pipeline. A better choice, for Mesa’s current state of optimization, would’ve been to do glTexImage -> glGetTexImage. This would leverage all the work that I did however many years ago in this post for PBO downloads using compute shaders.
Still, this is the future, so Mesa must adapt. With a few copy/pasted lines and a sprinkle of magical SGC dust (massive compute-based PBO shaders), the flamegraph becomes:
Flamegraphs aren’t everything though. This has more obvious real world results just from trace replay times:
# before
$ time glretrace -b BlazBlue.Centralfiction.trace
/home/zmike/.local/share/Steam/steamapps/common/Proton - Experimental/files/bin/wine-preloader
Rendered 0 frames in 15.4088 secs, average of 0 fps
glretrace -b BlazBlue.Centralfiction.trace 13.88s user 0.35s system 91% cpu 15.514 total
# after
$ time glretrace -b BlazBlue.Centralfiction.trace
/home/zmike/.local/share/Steam/steamapps/common/Proton - Experimental/files/bin/wine-preloader
Rendered 0 frames in 10.6251 secs, average of 0 fps
glretrace -b BlazBlue.Centralfiction.trace 9.83s user 0.42s system 95% cpu 10.747 total
Considering this trace only captured the first 4-5 seconds of a 98 second movie, I’d say that’s damn good.
Check out the MR if you want to test.
Hi all!
This status update will be shorter than usual because I had a lot less free
time for my open-source projects than usual this month. Indeed, I recently
joined SNCF Réseau (the company responsible for the French railway
infrastructure) to work on OSRD, an open-source tool to design and operate
railway networks. The project’s immediate goal is to fit new freight trains in
an existing timetable a few days in advance, but the longer term scope is much
larger. Working partly on-site in a big team is quite the change of pace but I
like it so far!
I’ve released a lot of new versions this month! The big one is Wayland 1.23.0
which adds a mechanism to set the size of the internal connection buffer, an
enum-header mode for wayland-scanner to generate a header with only enums,
auto-generated enum validator functions for compositors, a new
deprecated-since attribute to mark parts of protocols as deprecated, and a
few other niceties. libliftoff 0.5.0 prioritizes layers that are frequently
updated, adds performance optimizations (a fast path when the intersection of
layers doesn’t change, a fast path for standard KMS properties, an early return
to avoid needlessly trying potential solutions) and a timeout to avoid stalling
the compositor for too long. soju 0.8.0 adds a new file upload IRC extension,
adds support for Unix domain sockets for HTTP and WebSocket listeners and
better spreads the load on multiple upstream servers on large deployments.
kanshi 1.7.0 adds output defaults and aliases. Phew, that was a mouthful!
In other Wayland news, the xdg-toplevel-icon protocol got merged after a long
and difficult process. I really hope we can improve the contribution experience
for future proposals. We realized that the governance document was missing the
review requirements, so I fixed that along the way. The wlroots
linux-drm-syncobj-v1 implementation has been merged (it’s been used by
gamescope for a few months - note that this does not include the wlroots
renderer, backend and scene-graph changes). Multiple wlroots versions can now
be installed side-by-side thanks to Violet Purcell. Sway has gained a new
color_profile output command to apply an ICC profile to an
output thanks to M. Stoeckl. A high-level API for colorimetry has
been added in libdisplay-info thanks to Pekka Paalanen, and support for HDMI
audio data blocks has been implemented by Sebastian Wick.
Let’s switch gears and talk about IRC updates. I’ve submitted an IRCv3 proposal
to fix a few ISUPPORT deficiencies - it will need a lot more
feedback and implementations before it can be accepted. I’ve continued
debugging Goguma’s duplicate message bug and I’m pleased to
announce that I’ve almost completely fixed it (I still experience it very
rarely somehow…). delthas has added support for adaptive color schemes (Goguma
now uses your preferred accent color if any). I’ve performed some more boring
maintenance tasks, for instance adding support for newer Android Gradle Plugin
version to webcrypto.dart, one of Goguma’s dependencies.
One last update to wrap up this post: Zhi Qu has added support for the ID
extension to go-imap, which is sadly required to connect to some
servers. That’s all for now, see you next month!
There are times when you feel your making no progress and there are other times when things feel like they are landing in quick succession. Luckily this definitely is the second when a lot of our long term efforts are finally coming over the finish line. As many of you probably know our priorities tend to be driven by a combination of what our RHEL Workstation customers need, what our hardware partners are doing and what is needed for Fedora Workstation to succeed. We also try to be good upstream partners and do patch reviews and participate where we can in working on upstream standards, especially those of course of concern to our RHEL Workstation and Server users. So when all those things align we are at our most productive and that seems to be what is happening now. Everything below is features in flight that will at the latest land in Fedora Workstation 41.
Artificial Intelligence
IBM Granite LLM models usher in a new era of open source AI.
One of the areas of great importance to Red Hat currently is working on enabling our customers and users to take advantage of the advances in Artificial Intelligence. We do this in a lot of interesting ways like our recently announced work with IBM to release the high quality Granite AI models under terms that make them the most open major vendor AI models according to the Stanford Foundation Model Transparency Index , but not only are we releasing the full LLM source code, we are also creating a project to make modifying and teaching the LLM a lot easier through a project we call Instructlab. Instructlab is enabling almost anyone to quickly download a Granite LLM model and start teaching it specific things relevant to you or your organization. This put you in control of the AI and what it knows and can do as opposed to being demoted to a pure consumer.
And it is not just Granite, we are ensuring other other major AI projects will work with Fedora too, like Meta’s popular Llama LLM. And a big step for that is how Tom Rix has been working on bringing in AMD accelerated support (ROCm) for PyTorch to Fedora. PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. The long term goal is that you should be able to just install PyTorch on Fedora and have it work hardware accelerated with any of the 3 major GPU vendors chipsets.
NVIDIA in Fedora
So the clear market leader at the moment for powering AI workloads in NVIDIA so I am also happy to let you know about two updates we are working on that will make you life better on Fedora when using NVIDIA GPUs, be that for graphics or for compute or Artificial Intelligence. So for the longest time we have had easy install of the NVIDIA driver through GNOME Software in Fedora Workstation, unfortunately this setup never dealt with what is now the default usecase, which is using it with a system that has secure boot enabled. So the driver install was dropped from GNOME Software in our most recent release as the only way for people to get it working was through using mokutils on the command line, but the UI didn’t tell you that. Well we of course realize that sending people back to the command line to get this driver installed is highly unfortunate so Milan Crha has been working together with Alan Day and Jakub Steiner to come up with a streamlined user experience in GNOME Software to let you install the binary NVIDIA driver and provide you with an integrated graphical user interface help to sign the kernel module for use with secure boot. This is a bit different than what we for instance are doing in RHEL, where we are working with NVIDIA to provide pre-signed kernel modules, but that is a lot harder to do in Fedora due to the rapidly updating kernel versions and which most Fedora users appreciate as a big plus. So instead what we are for opting in Fedora is as I said to make it simple for you to self-sign the kernel module for use with secure boot. We are currently looking at when we can make this feature available, but no later than Fedora Workstation 41 for sure.
Toolbx getting top notch NVIDIA integration
Container Toolbx enables developers quick and easy access to their favorite development platforms
Toolbx, our incredible developer focused containers tool, is going from strength to strength these days with the rewrite from the old shell scripts to Go starting to pay dividends. The next major feature that we are closing in on is full NVIDIA driver support with Toolbx. As most of you know Toolbx is our developer container solution which makes it super simple to set up development containers with specific versions of Fedora or RHEL or many other distributions. Debarshi Ray has been working on implementing support for the official NVIDIA container device interface module which should enable us to provide full NVIDIA and CUDA support for Toolbx containers. This should provide reliable NVIDIA driver support going forward and Debarshi is currently testing various AI related container images to ensure they run smoothly on the new setup.
We are also hoping the packaging fixes to subscription manager will land soon as that will make using RHEL containers on Fedora a lot smoother. While this feature basically already works as outlined here we do hope to make it even more streamlined going forward.
Open Source NVIDIA support
Of course being Red Hat we haven’t forgotten about open source here, you probably heard about Nova our new Rust based upstream kernel driver for NVIDIA hardware which will provided optimized support for the hardware supported by NVIDIAs firmware (basically all newer ones) and accelerate Vulkan through the NVK module and provide OpenGL through Zink. That effort is still quite early days, but there is some really cool developments happening around Nova that I am not at liberty to share yet, but I hope to be able to talk about those soon.
High Dynamic Range (HDR)
Jonas Ådahl after completing the remote access work for GNOME under Wayland has moved his focus to help land the HDR support in mutter and GNOME Shell. He recently finished rebasing his HDR patches onto a wip merge request from
Georges Stavracas which ported gnome-shell to using paint nodes,
So the HDR enablement in mutter and GNOME shell is now a set of 3 patches.
First the paint nodes port
then the base HDR plumbing patch
then the patch to allow SDR and HDR content to run side by side
With this the work is mostly done, what is left is avoiding over exposure of the cursor, and inhibiting direct scanout.
We also hope to help finalize the upstream Wayland specs soon so that everyone can implement this and know the protocols are stable and final.
DRM leasing – VR Headsets
The most common usecase for DRM leasing is VR headsets, but it is also a useful feature for things like video walls. José Expósito is working on finalizing a patch for it using the Wayland protocol adopted by KDE and others. We where somewhat hesitant to go down this route as we felt a portal would have been a better approach, especially as a lot of our experienced X.org developers are worried that Wayland is in the process of replicating one of the core issues with X through the unmanageable plethora of Wayland protocols that is being pushed. That said, the DRM leasing stuff was not a hill worth dying on here, getting this feature out to our users in a way they could quickly use was more critical, so DRM leasing will land soon through this merge request.
Explicit sync
Another effort that we have put a lot of effort into together with our colleagues at NVIDIA is landing support for what is called explicit sync into the Linux kernel and the graphics drivers.The linux graphics stack was up to this point using something called implicit sync, but the NVIDIA drivers did not work well with that and thus people where experiencing ‘blinking’ applications under Wayland. So we worked with NVIDIA and have landed the basic support in the kernel and in GNOME and thus once the 555 release of the NVIDIA driver is out we hope the ‘blinking’ issues are fully resolved for your display. There has been some online discussion about potential performance gains from this change too, across all graphics drivers, but the reality of this is somewhat uncertain or at least it is still unclear if there will be real world measurable gains from adding explicit sync. I heard knowledgeable people argue both sides with some saying there should be visible performance gains while others say the potential gains will be so specific that unless you write a test to benchmark it explicitly you will not be able to detect a difference. But what is beyond doubt is that this will make using the NVIDIA stack with Wayland a lot better a that is a worthwhile goal in itself. The one item we are still working on is integrating the PipeWire support for explicit sync into our stack, because without it you might have the same flickering issues with PipeWire streams on top of the NVIDIA driver that you have up to now seen on your screen. So for instance if you are using PipeWire for screen capture it might look fine on screen with the fixes already merged, but the captured video has flickering. Wim Taymans landed some initial support in PipeWire already so now Michel Dänzer is working on implementing the needed bits for PipeWire in mutter. At the same time Wim is working on ensuring we have a testing client available to verify the compositor support. Once everything has landed in mutter and we been able to verify that it works with the sample client we will need to add support to client applications interacting with PipeWire, like Firefox, Chrome, OBS Studio and GNOME-remote-desktop.
In the previous post I gave an introduction to shader linking.
Mike has already blogged about this topic
a while ago, focusing mostly on Zink, and now it’s time for me to share some of my adventures about it too,
but of course focusing on how we improved it in RADV and the various rabbit holes that
this work has lead me to.
Motivation for improving shader linking in RADV
In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and
that is where link-time optimizations happen.
The big news is that Marek Olšák wrote a new pass called
nir_opt_varyings which is an all-in-one solution to all the optimizations above, and
now authors of various drivers are rushing to take advantage of this new code.
We can’t miss the opportunity to start using nir_opt_varyings in RADV too,
so that’s what I’ve been working on for the past several weeks.
It is intended to replace all of the previous linking passes and can do all of the following:
Remove unused outputs and undefined inputs (including “system values”)
Code motion between stages
Compaction of I/O space
So, I started by adding a call to nir_opt_varyings and went from there.
Enabling the new linking optimization in RADV
The naive reader might think using the new pass is as simple as
going to radv_link_shaders and calling nir_opt_varyings there.
But it can never be that easy, can it?
The issue is that one can’t simply deal with shader linking.
We also need to get our hands dirty with all the details of how shader I/O works
on a very low level.
I/O variable dereferece vs. lowered explicit I/O
The first problem is that RADV’s current linking solution radv_link_shaders
works with the shaders when their I/O variables are still intact,
meaning that they are still treated as dereferenced variables.
However nir_opt_varyings expects to work with explicit I/O.
In fact, all of RADV’s linking code all works based on I/O variables and dereferences,
so much so that it’d be too much to refactor all of that
(and such refactoring probably would have its own set of problems and rabbit holes).
So the solution here is to add a new linking step that runs after nir_lower_io and
call the new pass in there.
Here is the MR that enables nir_opt_varyings in RADV
After writing the above, I quickly discovered that
some tests crash, others fail, and most applications render incorrectly.
So I’ve set out on a journey to solve all that.
The rabbit holes
Shader I/O information collection
Like every driver, RADV needs to collect certain information about every shader in
order to program the GPU’s registers correctly before a draw.
This information includes the number of inputs / outputs (and which slots are used),
in order to determine how much LDS needs to be allocated (in case of tessellation shaders)
or how much FS inputs are needed, etc.
This is done by radv_nir_shader_info_pass which also operated on I/O variables,
rather than information from I/O instructions.
However, after nir_lower_io the explicit I/O instructions may no longer be in
sync with the original I/O variables. This wasn’t a problem before because we
haven’t done any optimizations on explicit I/O so we could rely on I/O variable
information being accurate.
However, in order to use an optimization based on explicit I/O,
the RADV shader info pass had to be refactored
to collect its information from explicit I/O intrinsics, otherwise we wouldn’t be
able to have up-to-date information after running nir_opt_varyings, resulting
in wrong register programming.
Part 1 of refactoring RADV to use IO semantics, covering everything but FS
Part 2 of refactoring RADV to use IO semantics, covering FS and deleting IO variables afterwards
Dealing with driver locations
Driver location assignment is how the driver decides which input and output
goes to which “slot” or “address”. RADV did this after linking,
but still based on I/O variables, so the mechanism needed to be re-thought.
It also came to my attention that
there are some plans to deprecate the concept of driver locations in NIR
in favour of so-called I/O semantics.
So I had to do my refactor with this in mind; I spent some effort on removing
our uses of driver locations in order to make the new code somewhat future-proof.
For most stage combinations, nir_recompute_io_bases can be used as a stopgap to
simply reassign the driver locations based on the assumption that a shader will
only write the outputs that the next stage reads.
However, this is somewhat difficult to achieve for tessellation shaders because of their
unique “brain puzze” (TCS can read its own outputs, so the compiler can’t simply
remove TCS outputs when TES doesn’t read them).
Removed driver locations for trivial cases
Refactored tessellation I/O to be independent of driver locations (based on all the other work listed below on tessellation shader I/O)
Tessellation control shader epilogs
Due to the unique brain puzzle that is I/O in tessellation shaders,
they require extra brain power…
shader linking between TCS and TES was implemented
all over the place; even our backend compiler ACO had some dependence on TCS linking information,
which made any kind of refactor difficult.
At the time, VK_EXT_shader_object was new, and our implementation
used so-called shader epilogs to deal with the dynamic states of the TCS
(including in OpenGL for RadeonSI),
which is what ACO needed the linking information for.
After a discussion with the team, we decided that TCS epilogs had to go;
not only because of my shader linking effort, but also to make the
code base saner and more maintainable.
I changed the code to pass tess factors in registers
Then I entirely deleted TCS epilogs from RADV
Finally I also deleted them from RadeonSI, allowing it to be ultimately removed from ACO as well
This effort made our code lighter by about 1200 LOC.
Tessellation shader I/O
On AMD hardware, TCS outputs are implemented using LDS (when the TCS reads them)
and VRAM (when the TES reads them), which means that the driver has two different
ways to store these variables depending on their use. However, since the code
was based on driver_location and there can only be one driver location,
we used effectively the same location for both LDS and VRAM, which was suboptimal.
With the TCS epilog out of the way, now ac_nir_lower_tess_io_to_mem is free to
choose the LDS layout because the drivers no longer need to generate a TCS epilog
that would need to make assumptions about their memory layout.
Changed the LDS location to be independent of the VRAM location, making it more efficient
Also removed an extra dword of unused LDS when VS outputs don’t need it
Refactored the SGPR arg that contains the dynamic state
With all of that, I found an inefficiency in an ACO optimization, so
that had to be fixed too
Fix of an unintended regression
In the meantime, Samuel has found an opportunity to share some code between RADV and RadeonSI
After all the above was merged and tessellation I/O was independent from driver locations,
I also made another MR to undo a small stats regression by using a more optimal VRAM layout as well.
Fixing another regression
Packed 16-bit I/O
One of the innovations of nir_opt_varyings is that it can pack two 16-bit inputs and
outputs together into a single 32-bit slot in order to save I/O space.
However, unfortunately RADV didn’t really work at all with packed 16-bit I/O.
Practically, this meant that every test case using 16-bit I/O failed.
I considered to disable 16-bit packing in nir_opt_varyings but eventually
I decided to just implement it in RADV properly instead.
Packed 16-bit mesh shader outputs
Packed 16-bit FS inputs
Packed 16-bit pre-rasterization outputs
Packed 16-bit tessellation and GS I/O
Repetitiveness in AMD common code
While writing patches to handle packed 16-bit I/O, we’ve taken note of
how repetitive the code was; basically the same thing was implemented several
times with subtle differences. Of course, this had to be dealt with.
Refactored pre-rasterization output info to avoid repetition
Mesh shading per-primitive I/O
Mesh shading pipelines have so-called per-primitive I/O, which need special handling.
For example, it is wrong to pack per-primitive and per-vertex inputs or outputs
into the same slot. Because OpenGL doesn’t have per-primitive I/O, this was
left unsolved and needed to be fixed in nir_opt_varyings before RADV could use it.
nir_recompute_io_bases needed to learn about per-primitive I/O
nir_opt_varyings itself needed to learn per-primitive I/O too
Updating load/store alignments
Using nir_opt_varyings had a slight regression in shader instruction counts
due to inter-stage code motion.
Essentially, it moved some instructions into the previous stage, which prevented
nir_opt_load_store_vectorize to deduce the alignment of some memory instructions,
resulting in worse vectorization.
The solution was to add a new pass based on the code already written for
nir_opt_load_store_vectorize that would just update the alignments of each
memory access called nir_opt_load_store_update_alignments, and run that pass before
nir_opt_varyings, thereby preserving the aligment info before it is lost.
Added nir_opt_load_store_update_alignments
Promoting FS inputs to FLAT
FLAT fragment shader inputs are generally better because they require no interpolation
and allow packing 32-bit and 16-bit inputs together, so
nir_opt_varyings takes special care trying to promote interpolated inputs to
FLAT when possible.
However, there was a regression caused by this mechanism unintentionally
creating more inputs than there were before.
Here is the fix so that it only promotes to FLAT when appropriate
Dealing with mediump I/O
The mediump qualifier effectively means that the application allows the driver to
use either 16-bit or 32-bit precision for a variable, whichever it
deems more optimal for a specific shader.
It turned out that RADV didn’t deal with mediump I/O at all.
This was fine, because it just meant they got treated as normal 32-bit I/O,
but it became a problem when it turned out nir_opt_varyings is unaware of
mediump and mixed it up with other inputs, which confused the vectorizer.
This MR deals with mediump by lowering it to normal 32-bit I/O
Side note: I am not quite done yet with mediump.
In the future I plan to lower it to 16-bit precision,
but only when I can make sure it doesn’t result in more inputs.
Explicitly interpolated (per-vertex) FS inputs
Vulkan has some functionality that allow applications to do custom FS input interpolation
in shader code, which from the driver perspective means that each FS invocation needs to
access the output of each vertex. (This requires special register programming from RADV
on the FS inputs.) This is something that also needed to be added to nir_opt_varyings
before RADV could use it.
Marek wrote some patches to deal with explicit FS inputs
Scalarization and re-vectorization of I/O components
The nir_opt_varyings pass only works on so-called scalarized I/O, meaning that it can
only deal with instructions that write a single output component or read a single
input component. Fortunately, there is already a handy nir_lower_io_to_scalar
pass which we can use.
The downside is that scalarized shader I/O becomes sub-optimal (on all stages on AMD HW except VS -> PS)
because the I/O instructions are really memory accesses, which are simply
more optimal when more components are accessed by the same instruction.
This is solved in two ways:
Rhys has enhanced the excellent nir_opt_load_store_vectorize pass to better deal with
lowered shader I/O, meaning that it can now better vectorize the memory access instructions
that are generated by scalarized I/O.
Marek has developed a new nir_opt_vectorize_io pass which can re-vectorize the
I/O intrinsics (before they are lowered to memory access).
Shader I/O stats
One of the main questions with any kind of optimization is how to measure the
effects of that optimization in an objective way. This is a solved problem,
we have shader stats for this which contain instructions about various aspects
of a shader, such as number of instructions, register use etc. Except, there was
no stats about I/O, so this needed to be added.
These stats are useful to prove that all of this work actually improved things.
Furthermore, they turned out useful for finding bugs in existing code as well.
Added shader stats for inputs and outputs
Fix for a tessellation I/O bug discovered using I/O stats
Removing the old linking passes
Considering that nir_opt_varyings is supposed to be an all-in-one linking solution,
a naive person like me would assume that once the driver had been refactored to use
nir_opt_varyings we can simply stop using all of the old linking passes. But…
It turns out that getting rid of any of the other passes seem to cause regressions in shaders stats
such as instruction count (which we don’t want).
Why is this?
Due to the order in which we call various NIR optimizations, it seems that we can’t effectively
take advantake of new optimization opportunities after nir_opt_varyings. This means that
we either have to re-run all expensive optimizations once more after the new linking step,
or we will have to reorder our optimizations in order to be able to remove the old linking
passes.
Conclusion; future work
While we haven’t yet found the exit from all the rabbit holes we fell into, we made
really good progress and I feel that all of our I/O code ended up better after this effort.
Some work on RADV shader I/O (as of June 2024) remains:
Completely remove the old linking passes
Stop using driver locations (in the few places where they are still used)
Lower mediump to 16-bit when beneficial
Thanks
I owe a big thank you to Marek for developing nir_opt_varyings in the first place
and helping me adopt it every step of the way.
Also thanks to Samuel, Rhys and Georg for the good conversations we had and for reviewing my patches.
In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second. The userspace driver is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.
Yesterday evening we released systemd v256 into the wild. While other projects,
such as Firefox
are just about to leave the 7bit world and enter 8bit territory, we already
entered 9bit version territory! For details about the release, see our
announcement
mail.
In the weeks leading up to this release I have posted a series of serieses of
posts to Mastodon about key new features in this release. Mastodon has its
goods and its bads. Among the latter is probably that it isn't that great for
posting listings of serieses of posts. Hence let me provide you with a list of
the relevant first post in the series of posts here:
Post #1: .v/ Directories
Post #2: User-Scoped Encrypted Service Credentials
Post #3: X_SYSTEMD_UNIT_ACTIVE= sd_notify() Messages
Post #4: System-wide ProtectSystem=
Post #5: run0 As sudo Replacement
Post #6: System Credentials
Post #7: Unprivileged DDI Mounts + Unprivileged systemd-nspawn
Post #8: ssh into systemd-homed Accounts
Post #9: systemd-vmspawn
Post #10: Mutable systemd-sysext
Post #11: Network Device Ownership
Post #12: systemctl sleep
Post #13: systemd-ssh-generator
Post #14: systemd-cryptenroll without device argument
Post #15: dlopen() ELF Metadata
Post #16: Capsules
I intend to do a similar series of serieses of posts for the next systemd
release (v257), hence if you haven't left tech Twitter for Mastodon yet, now is
the opportunity.
And while I have you: note that the All Systems Go 2024 Conference
(Berlin) Call for Papers ends 😲 THIS WEEK 🤯!
Hence, HURRY, and get your submissions in
now, for the best
low-level Linux userspace conference around!
Shader linking is one of the more complicated topics in graphics driver development.
It is both a never ending effort in the pursuit of performance and a black hole in which driver developers disappear.
In this post, I intend to give an introduction to what shader linking is
and why it’s worth spending our time working on it in general.
Let’s start with the basics.
Shaders are smallish programs that run on your GPU and are necessary
for any graphical application in order to draw things on your screen using your GPU.
A typical game can have thousands, or even hundreds of thousands of shaders.
Because every GPU has its own instruction set with major differences, it is generally the
reponsibility of the graphics driver to compile shaders in a way that is most optimal
on your GPU in order to make games run fast.
One of the ways of making them faster is linking.
Many times, the driver knows exactly which shaders are going to be used together and
that gives it the opportunity to perform optimizations based on the assumption that
two shaders are only ever used together (and never with other shaders).
In Vulkan, there are now three ways for
an application to create graphics shaders
and all of these have a possibility for utilizing linking:
Graphics pipelines, which contain a set of shaders and a bunch of graphics state baked together.
We can always link shaders of a graphics pipeline, because it is impossible to change the shaders
in a pipeline after it was created.
Graphics pipeline libraries, which essentially split a pipeline into some parts.
The API allows to create full pipelines by linking pipeline libraries together.
Shader objects, which is a newer extension that deals with shaders individually,
however still allows the application to ask the driver link them together.
In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and
that is where link-time optimizations happen.
Why is shader linking beneficial to performance?
Shader linking allows the compiler stack to make assumptions about a shader by looking at another
shader that it is used together with.
Let’s take a look at what optimizations are possible.
Deletion of I/O operations
The compiler may look at the outputs of a shader and the inputs of the next stage,
and delete unnecessary I/O.
For example, when you have a pipeline with VS (vertex shader) and FS (fragment shader):
The compiler can safely remove outputs from the VS that are not read by the FS.
As a result, any shader code which was used to calculate the value
of those outputs will now become dead code and may be removed as well.
The compiler can replace FS inputs that aren’t written by the VS
with an “undefined” value (or zero) which algebraic optimizations can take advantage of.
The compiler may detect that some VS outputs are duplicates and may remove the superfluous outputs
from the VS, and also merge their corresponding inputs in the FS.
When a VS output is a constant value, the compiler can delete the VS output and replace
the corresponding FS input load instructions with the constant value. This enables
further algebraic optimizations in the FS.
As a result, both the VS and FS will have fewer IO instructions and more optimal algebraic instructions.
The same ideas are basically applicable to any two shader stages.
This first group of optimizations are the easiest to implement and has been supported
by NIR for a long time: nir_remove_dead_variables, nir_remove_unused_varyings and nir_link_opt_varyings
have existed for a long time.
Compaction of I/O space
Shader linking also lets the compiler to “compact” the
output space, by reordering I/O variables in both shaders
so that they use the least amount of space possible.
For example, it may be possible that they have “gaps” between the I/O slots that they use,
and the compiler can then be smart and rearrange the I/O variables so that there are as
few gaps as possible.
As a result, less I/O space will be used. The exact benefit of this optimization depends highly
on the hardware architecture and which stages are involved. But generally speaking, using less
space can mean less memory use (which can translate into better occupancy or higher throughput),
or simply that the next stage is launched faster, or will use fewer registers, or
less fixed-function HW resources needed, etc.
NIR has also supported I/O compaction in nir_compact_varyings, however its implementation
was far from perfect, the main challenges were handling indirect indexing and
it lacked packing 16-bit varyings into 32-bits.
Code motion between shader stages
Also known as inter-stage code motion, this is a complex optimization that has two main goals:
Create more optimization opportunities for doing more of the aforementioned optimizations.
For example, when you have two outputs such that the value of one output can be trivially
calculated from the other, it may be beneficial to just calculate that value in the next stage,
which then enables us to remove the extra output.
Move code to earlier stages, with the assumption that earlier stages have fewer invocations
than later ones, which means the same instructions will need to be executed fewer times,
making the pipeline overall faster. Most of the time, we can safely assume that there
are fewer VS invocations than FS, so it’s overall a good thing to move instructions
from FS to VS. The same is unquestionably beneficial for geometry amplification, such as
moving code from TES to TCS.
This concept is all-new in Mesa and hasn’t existed until Marek wrote nir_opt_varyings recently.
Why is all of that necessary?
At this point you might ask yourself the question, why is all of this necessary?
In other words, why do shaders actually need these optimizations?
Why don’t app developers write shaders that are already optimal?
The answer might surprise you.
Many times, the same shader is reused between different pipelines,
in which case the application developer needs to write them in a way in which they
are interchangeable. This is simply a good practice from the perspective of
the application developer, reducing the number of shaders they need to maintain.
Sometimes, applications effectively generate different shaders from the same
source using ifdefs, specialization constants etc.
Even though the same source shader was written to usable with multiple other shaders;
in each pipeline the driver can deal with it as if it were a different shader
and in each pipeline the shader will be linked to the other shaders in that specific pipeline.
What’s new in Mesa shader linking?
The big news is that Marek Olšák wrote a new pass called
nir_opt_varyings which is an all-in-one solution to all the optimizations above, and
now authors of various drivers are rushing to take advantage of this new code.
I’ll share my experience of using that in RADV in the next blog post.
This is a long-awaited update to the previous mesh shading related posts.
RDNA3 brings many interesting improvements to the hardware which simplify how
mesh shaders work.
Reminder: main limitation of mesh shading on RDNA2
RDNA2 already supported mesh and task shaders, but mesh shaders had a big caveat
regarding how outputs work: each shader invocation could only really write
up to 1 vertex and 1 primitive, which meant that the shader compiler had
to work around that to implement the programming model of the mesh shading API.
On RDNA2 the shader compiler had to:
Make sure to launch (potentially) more invocations than the API shader needed,
to accomodate a larger number of output vertices / primitives.
Save all outputs to memory (shared memory or VRAM) and reload them at the end,
unless they were only indexed by the local invocation index.
Shader output changes on RDNA3
RDNA3 changes how shader outputs work on pre-rasterization stages (including VS, TES, GS, MS).
Attribute ring
Previous architectures had a special buffer called parameter cache where
the pre-rasterization stage stored positions and
generic output attributes for fragment shaders (pixel shaders) to read.
The parameter cache was removed from RDNA3 in favour of the attribute ring
which is basically a buffer in VRAM. Shaders must now store their outputs to this
buffer and after rasterization, the HW reads the attributes from the attribute ring
and stores them to the LDS space of fragment shaders.
When I first heard about the attribute ring I didn’t understand how this is an improvement
over the previous design (VRAM bandwidth is considered a bottleneck in many cases),
but then I realized that this is meant to work together
with the Infinity Cache that these new chips have. In the ideal access pattern,
each attribute store would overwrite a full cache line so the shader won’t actually touch VRAM.
For mesh shaders, this has two consequences:
Any invocation can now truly write generic attributes of any other invocation
without restrictions, because these are just a memory write.
The shader compiler now has to worry about memory access patterns.
RADV already supports the attribute ring for VS, TES and GS so we have some experience with
how it works and only needed to apply that to mesh shaders.
Row exports
For non-generic output attributes (such as position, clip/cull distances, etc.) we
still need to use exp instructions just like the old hardware. However, these now
have a new mode called row export which allows each lane to write not only its own
outputs but also others in the same row.
Basic RDNA3 mesh shading: legacy fast launch mode
The legacy fast launch mode is essentially the same thing as RDNA2 had, so in this mode
mesh shaders can be compiled with the same structure and the compiler only needs to be
adjusted to use the attribute ring.
The drawback of this mode is that it still has the same issue with workgroup size
as RDNA2 had. So this is just useful for helping driver developers port their code
to the new architecture but it doesn’t allow us to fully utilize the new capabilities
of the hardware.
The initial MS implementation in RADV used this mode.
New fast launch mode
In this mode, the number of HW shader invocations is determined similarly to how
a compute shader would work, and there is no need to match the number of vertices
and primitives in this mode.
Thanks to Rhys for working on this and enabling the new mode on RDNA3.
Based on the information we can glean from the open source progress
(in particular, the published register files) happening thus far,
we think RDNA4 will only support this new mode.
What took you so long?
I’ve wanted to write about this for some time, but somehow forgot that I have a blog… Sorry!
References
As always, what I discuss here is based on open source driver code including
mesa (RadeonSI and RADV) and AMD’s reference driver code.
RadeonSI and RADV already had code and comments that explain the attribute ring.
RDNA3 shader ISA and LLPC mesh/task code
do a good job at explaining row exports and hint at the new fast launch mode.
PAL’s reference implementation explains how the draw packets work.
Back in the day when presumably at least someone was young, the venerable xsetwacom tool was commonly used to configure wacom tablets devices on Xorg [1]. This tool is going dodo in Wayland because, well, a tool that is specific to an X input driver kinda stops working when said X input driver is no longer being used. Such is technology, let's go back to sheep farming.
There's nothing hugely special about xsetwacom, it's effectively identical to the xinput commandline tool except for the CLI that guides you towards the various wacom driver-specific properties and knows the right magic values to set. Like xinput, xsetwacom has one big peculiarity: it is a fire-and-forget tool and nothing is persistent - unplugging the device or logging out would vanish the current value without so much as a "poof" noise [2].
If also somewhat clashes with GNOME (or any DE, really). GNOME configuration works so that GNOME Settings (gnome-control-center) and GNOME Tweaks write the various values to the gsettings. mutter [3] picks up changes to those values and in response toggles the X driver properties (or in Wayland the libinput context). xsetwacom short-cuts that process by writing directly to the driver but properties are "last one wins" so there were plenty of use-cases over the years where changes by xsetwacom were overwritten.
Anyway, there are plenty of use-cases where xsetwacom is actually quite useful, in particular where tablet behaviour needs to be scripted, e.g. switching between pressure curves at the press of a button or key. But xsetwacom cannot work under Wayland because a) the xf86-input-wacom driver is no longer in use, b) only the compositor (i.e. mutter) has access to the libinput context (and some behaviours are now implemented in the compositor anyway) and c) we're constantly trying to think of new ways to make life worse for angry commenters on the internets. So if xsetwacom cannot work, what can we do?
Well, most configurations possible with xsetwacom are actually available in GNOME. So let's make those available to a commandline utility! And voila, I present to you gsetwacom, a commandline utility to toggle the various tablet settings under GNOME:
$ gsetwacom list-devices
devices:
- name: "HUION Huion Tablet_H641P Pen"
usbid: "256C:0066"
- name: "Wacom Intuos Pro M Pen"
usbid: "056A:0357"
$ gsetwacom tablet "056A:0357" set-left-handed true
$ gsetwacom tablet "056A:0357" set-button-action A keybinding "<Control><Alt>t"
$ gsetwacom tablet "056A:0357" map-to-monitor --connector DP-1
Just like xsetwacom was effectively identical to xinput but with a domain-specific CLI, gsetwacom is effectively identical to the gsettings tool but with a domain-specific CLI. gsetwacom is not intended to be a drop-in replacement for xsetwacom, the CLI is very different. That's mostly on purpose because I don't want to have to chase bug-for-bug compatibility for something that is very different after all.
I almost spent more time writing this blog post than on the implementation so it's still a bit rough. Also, (partially) due to how relocatable schemas work error checking is virtually nonexistent - if you want to configure Button 16 on your 2-button tablet device you can do that. Just don't expect 14 new buttons to magically sprout from your tablet. This could all be worked around with e.g. libwacom integration but right now I'm too lazy for that [4]
Oh, and because gsetwacom writes the gsettings configuration it is persistent, GNOME Settings will pick up those values and they'll be re-applied by mutter after unplug. And because mutter-on-Xorg still works, gsetwacom will work the same under Xorg. It'll also work under the GNOME derivatives as long as they use the same gsettings schemas and keys.
Le utilitaire est mort, vive le utilitaire!
[1] The git log claims libwacom was originally written in 2009. By me. That was a surprise...
[2] Though if you have the same speakers as I do you at least get a loud "pop" sound whenever you log in/out and the speaker gets woken up
[3] It used to be gnome-settings-daemon but with mutter now controlling the libinput context this all moved to mutter
[4] Especially because I don't want to write Python bindings for libwacom right now
Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is
the first conformant
Vulkan® for Apple hardware on any operating system, implementing the
full 1.3 spec without “portability” waivers.
Honeykrisp is not yet released for end users. We’re
continuing to add features, improve performance, and port to more
hardware. Source
code is available for developers.
HoloCure running on
Honeykrisp ft. DXVK, FEX, and Proton.
Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s
open source NVK
driver for NVIDIA GPUs. In her words:
All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan
driver and started by copying+pasting from it. My hope is that NVK will
eventually become the driver that everyone copies and pastes from. To
that end, I’m building NVK with all the best practices we’ve developed
for Vulkan drivers over the last 7.5 years and trying to keep the
code-base clean and well-organized.
Why spend years implementing features from scratch when we can reuse
NVK? There will be friction starting out, given NVIDIA’s desktop
architecture differs from the M1’s mobile roots. In exchange, we get a
modern driver designed for desktop games.
We’ll need to pass a half-million tests ensuring correctness, submit the
results, and then we’ll become conformant after 30 days of industry
review. Starting from NVK and our OpenGL 4.6 driver… can we write a
driver passing the Vulkan 1.3 conformance test suite faster
than the 30 day review period?
It’s unprecedented…
Challenge accepted.
April 2
It begins with a text.
Faith… I think I want to write a Vulkan driver.
Her advice?
Just start typing.
There’s no copy-pasting yet – we just add M1 code to NVK and remove
NVIDIA as we go. Since the kernel mediates our access to the hardware,
we begin connecting “NVK” to Asahi
Lina’s kernel driver using code shared with OpenGL. Then we plug in
our shader compiler and hit the hay.
April 3
To access resources, GPUs use “descriptors” containing the address,
format, and size of a resource. Vulkan bundles descriptors into “sets”
per the application’s “descriptor set layout”. When compiling shaders,
the driver lowers descriptor accesses to marry the set layout with the
hardware’s data structures. As our descriptors differ from NVIDIA’s, our
next task is adapting NVK’s descriptor set lowering. We start with a
simple but correct approach, deleting far more code than we add.
April 4
With working descriptors, we can compile compute shaders. Now we
program the fixed-function hardware to dispatch compute. We first add
bookkeeping to map Vulkan command buffers to lists of M1 “control
streams”, then we generate a compute control stream. We copy that code
from our OpenGL driver, translate the GL into Vulkan, and compute
works.
That’s enough to move on to “copies” of buffers and images. We
implement Vulkan’s copies with compute shaders, internally dispatched
with Vulkan commands as if we were the application. The first copy test
passes.
April 5
Fleshing out yesterday’s code, all copy tests pass.
April 6
We’re ready to tackle graphics. The novelty is handling graphics
state like depth/stencil. That’s straightforward, but there’s a
lot of state to handle. Faith’s code collects all “dynamic
state” into a single structure, which we translate into hardware control
words. As usual, we grab that translation from our OpenGL driver, blend
with NVK, and move on.
April 7
What makes state “dynamic”? Dynamic state can change without
recompiling shaders. By contrast, static state is baked into shader
binaries called “pipelines”. If games create all their pipelines during
a loading screen, there is no compiler “stutter” during gameplay. The
idea hasn’t quite panned out: many game developers don’t know their
state ahead-of-time so cannot create pipelines early. In response,
Vulkan has made
ever
more
state
dynamic,
punctuated with the EXT_shader_object
extension that makes pipelines optional.
We want full dynamic state and shader objects. Unfortunately, the M1
bakes random state into shaders: vertex attributes, fragment outputs,
blending, even linked interpolation qualifiers. Like most of the
industry in the 2010s, the M1’s designers bet on pipelines.
Faced with this hardware, a reasonable driver developer would
double-down on pipelines. DXVK would stutter, but we’d pass
conformance.
I am not reasonable.
To eliminate stuttering in OpenGL, we make state dynamic with four
strategies:
Conditional code.
Precompiled variants.
Indirection.
Prologs and epilogs.
Wait, what-a-logs?
AMD also bakes state into shaders… with a twist. They divide the
hardware binary into three parts: a prolog, the shader, and an
epilog. Confining dynamic state to the periphery eliminates
shader variants. They compile prologs and epilogs on the fly, but that’s
fast and doesn’t stutter. Linking shader parts is a quick concatenation,
or long jumps avoid linking altogether. This strategy works for the M1,
too.
For Honeykrisp, let’s follow NVK’s lead and treat all state
as dynamic. No other Vulkan driver has implemented full dynamic state
and shader objects this early on, but it avoids refactoring later. Today
we add the code to build, compile, and cache prologs and epilogs.
Putting it together, we get a (dynamic) triangle:
April 8
Guided by the list of failing tests, we wire up the little bits
missed along the way, like translating border colours.
/* Translate an American VkBorderColor into a Canadian agx_border_colour */
enum agx_border_colour
translate_border_color(VkBorderColor color)
{
switch (color) {
case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK:
return AGX_BORDER_COLOUR_TRANSPARENT_BLACK;
...
}
}
Test results are getting there.
Pass: 149770, Fail: 7741,
Crash: 2396
That’s good enough for vkQuake.
April 9
Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1.
Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to
the finish line.
Pass: 255209, Fail: 3818,
Crash: 599
98.3% pass rate for 1.3 on our 1 week anniversary.
Not bad.
April 10
SuperTuxKart has a Vulkan renderer.
April 11
Zink works
too.
April 12
I tracked down some fails to a test bug, where an arbitrary
verification threshold was too strict to pass on some devices. I filed a
bug report, and it’s resolved
within a few weeks.
April 16
The tests for “descriptor indexing” revealed a compiler bug affecting
subgroup shuffles in non-uniform control flow. The M1’s shuffle
instruction is quirky, but it’s easy to workaround. Fixing that fixes
the descriptor indexing tests.
April 17
A few tests crash inside our register allocator. Their shaders
contain a peculiar construction:
if (condition) {
while (true) { }
}
condition is always false, but the compiler doesn’t know
that.
Infinite loops are nominally invalid since shaders must terminate in
finite time, but this shader is syntactically valid. “All loops contain
a break” seems obvious for a shader, but it’s false. It’s
straightforward to fix register allocation, but what a doozy.
April 18
Remember copies? They’re slow, and every frame currently requires a
copy to get on screen.
For “zero copy” rendering, we need enough Linux window system
integration to negotiate an efficient surface layout across process
boundaries. Linux uses “modifiers” for this purpose, so we implement the
EXT_image_drm_format_modifier
extension. And by implement, I mean copy.
Copies to avoid copies.
April 20
“I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux
Vulkan Mac.”
…
“Ma’am, this is a Wendy’s.”
April 22
As bug fixing slows down, we step back and check our driver
architecture. Since we treat all state as dynamic, we don’t pre-pack
control words during pipeline creation. That adds theoretical CPU
overhead.
Is that a problem? After some optimization, vkoverhead says we’re
pushing 100 million draws per second.
I think we’re okay.
April 24
Time to light up YCbCr. If we don’t use special YCbCr hardware, this
feature is “software-only”. However, it touches a lot of
code.
It touches so much code that Mohamed
Ahmed spent an entire summer adding it to NVK.
Which means he spent a summer adding it to Honeykrisp.
Thanks, Mohamed ;-)
April 25
Query copies are next. In Vulkan, the application can query the
number of samples rendered, writing the result into an opaque “query
pool”. The result can be copied from the query pool on the CPU or
GPU.
For the CPU, the driver maps the pool’s internal data structure and
copies the result. This may require nontrivial repacking.
For the GPU, we need to repack in a compute shader. That’s harder,
because we can’t just run C code on the GPU, right?
…Actually, we can.
A little witchcraft makes GPU query copies as easy as C.
void copy_query(struct params *p, int i) {
uintptr_t dst = p->dest + i * p->stride;
int query = p->first + i;
if (p->available[query] || p->partial) {
int q = p->index[query];
write_result(dst, p->_64, p->results[q]);
}
...
}
April 26
The final boss: border colours, hard mode.
Direct3D lets the application choose an arbitrary border colour when
creating a sampler. By contrast, Vulkan only requires three border
colours:
(0, 0, 0, 0) – transparent black
(0, 0, 0, 1) – opaque black
(1, 1, 1, 1) – opaque white
We handled these on April 8. Unfortunately, there are two
problems.
First, we need custom border colours for Direct3D compatibility. Both
DXVK and vkd3d-proton
require the EXT_custom_border_color
extension.
Second, there’s a subtle problem with our hardware, causing dozens of
fails even without custom border colours. To understand the issue, let’s
revisit texture descriptors, which contain a pixel format and a
component reordering swizzle.
Some formats are implicitly reordered. Common “BGRA” formats swap red
and blue for historical
reasons. The M1 does not directly support these formats. Instead,
the driver composes the swizzle with the format’s reordering. If the
application uses a BARB swizzle with a BGRA
format, the driver uses an RABR swizzle with an
RGBA format.
There’s a catch: swizzles apply to the border colour, but formats do
not. We need to undo the format reordering when programming the
border colour for correct results after the hardware applies the
composed swizzle. Our OpenGL driver implements border colours this way,
because it knows the texture format when creating the sampler.
Unfortunately, Vulkan doesn’t give us that information.
Without custom border colour support, we “should” be okay. Swapping
red and blue doesn’t change anything if the colour is white or
black.
There’s an even subtler catch. Vulkan mandates support for a
packed 16-bit format with 4-bit components. The M1 supports a similar
format… but with reversed “endianness”, swapping red and
alpha.
That still seems okay. For transparent black (all zero) and opaque
white (all one), swapping components doesn’t change the result.
The problem is opaque black: (0, 0,
0, 1). Swapping red and alpha gives
(1, 0, 0, 0). Transparent red?
Uh-oh.
We’re stuck. No known hardware configuration implements correct
Vulkan semantics.
Is hope lost?
Do we give up?
A reasonable person would.
I am not reasonable.
Let’s jump into the deep end. If we implement custom border colours,
opaque black becomes a special case. But how? The M1’s custom border
colours entangle the texture format with the sampler. A reasonable
person would skip Direct3D support.
As you know, I am not reasonable.
Although the hardware is unsuitable, we control software. Whenever a
shader samples a texture, we’ll inject code to fix up the border colour.
This emulation is simple, correct, and slow. We’ll use dirty driver
tricks to speed it up later. For now, we eat the cost, advertise full
custom border colours, and pass the opaque black tests.
April 27
All that’s left is some last minute bug fixing, and…
Pass: 686930, Fail: 0
Success.
The future
The next task is implementing everything that DXVK
and vkd3d-proton
require to layer Direct3D. That includes esoteric extensions like transform
feedback. Then Wine and an open source x86 emulator will
run Windows games on Asahi
Linux.
That’s getting ahead of ourselves. In the mean time, enjoy Linux
games with our conformant
OpenGL 4.6 drivers… and stay tuned.
Baby
Storm running on Honeykrisp ft. DXVK, FEX, and Proton.
Introduction Recently, the Linux Mint Blog published Monthly News – April 2024, which goes into detail about wanting to fork and maintain older GNOME apps in collaboration with other GTK-based desktop environments. Despite the good intentions of the author, Clem, many readers interpreted this as an attack against GNOME. Specifically: GTK, libadwaita, the relationship between them, and their relevance to any desktop environment or desktop operating system. Unfortunately, many of these readers seem to have a lot of difficulty understanding what GTK is trying to be, and how libadwaita helps. In this article, we’ll look at the history of why and how libadwaita was born, the differences between GTK 4 and libadwaita in terms of scope of support, their relevance to each desktop environment and desktop operating system, and the state of GTK 4 today. What Is GTK? First of all, what is GTK? GTK is a cross-platform widget toolkit from the GNOME Project, which means it provides interactive elements that developers can use to build their apps. The latest major release of GTK is 4, which brings performance improvements over GTK 3. GTK 4 also removes several widgets that were part of the GNOME design language, which became a controversy. In the context of application design, a design language is the visual characteristics that are communicated to the user. Fonts, colors, shapes, forms, layouts, writing styles, spacing, etc. are all elements of the design language.(Source) Unnecessary Unification of the Toolkit and Design Language In general, cross-platform toolkits tend to provide general-purpose/standard widgets, typically with a non-opinionated styling, i.e. widgets and design patterns that are used consistently across different operating systems (OSes) and desktop environments. However, GTK had the unique case of bundling GNOME’s design language into GTK, which made it far from generic, leading to problems of different lexicons, mainly philosophical and technical problems. Clash of Philosophies When we look at apps made for the GNOME desktop (will be referred to as “GNOME apps”) as opposed to non-GNOME apps, we notice that they’re distinctive: GNOME apps tend to have hamburger buttons, header bars, larger buttons, larger padding and margins, etc., while most non-GNOME apps tend to be more compact, use menu bars, standard title bars, and many other design metaphors that may not be used in GNOME apps. This is because, from a design philosophy standpoint, GNOME’s design patterns tend to go in a different direction than most apps. As a brand and product, GNOME has a design language it adheres to, which is accompanied by the GNOME Human Interface Guidelines (HIG). As a result, GTK and GNOME’s design language clashed together. Instead of being as general-purpose as possible, GTK as a cross-platform toolkit contained an entire design language intended to be used only by a specific desktop, thus defeating the purpose of a cross-platform toolkit. For more information on GNOME’s design philosophy, see “What is GNOME’s Philosophy?”. Inefficient Diversion of Resources The unnecessary unification of the toolkit and design language also divided a significant amount of effort and maintenance: Instead of focusing solely on the general-purpose widgets that could be used across all desktop OSes and environments, much of the focus was on the widgets that were intended to conform to the GNOME HIG. Many of the general-purpose widgets also included features and functionality that were only relevant to the GNOME desktop, making them less general-purpose. Thus, the general-purpose widgets were being implemented and improved slowly, and the large codebase also made the GNOME widgets and design language difficult to maintain, change, and adapt. In other words, almost everything was hindered by the lack of independence on both sides. Libhandy: the Predecessor Because of the technical bottlenecks caused by the philosophical decisions, libhandy was created in 2017, with the first experimental version released in 2018. As described on the website, libhandy is a collection of “[b]uilding blocks for modern adaptive GNOME applications.” In other words, libhandy provides additional widgets that can be used by GNOME apps, especially those that use GTK 3. For example, Boxes uses libhandy, and many GNOME apps that used to use GTK 3 also used libhandy. However, some of the problems remained: Since libhandy was relatively new at the time, most GNOME widgets were still part of GTK 3, which continued to suffer from the consequences of merging the toolkit and design language. Furthermore, GTK 4 was released at the end of December 2020 — after libhandy. Since libhandy was created before the initial release of GTK 4, it made little sense to fully address these issues in GTK 3, especially when doing so would have caused major breakages and inconveniences for GTK, libhandy, and app developers. As such, it wasn’t worth the effort. With these issues in mind, the best course of action was to introduce all these major changes and breakages in GTK 4, use libhandy as an experiment and to gain experience, and properly address these issues in a successor. Libadwaita: the Successor Because of all the above problems, libadwaita was created: libhandy’s successor that will accompany GTK 4. GTK 4 was initially released in December 2020, and libadwaita was released one year later, in December 2021. With the experience gained from libhandy, libadwaita managed to become extensible and easy to maintain. Libadwaita is a platform library accompanying GTK 4. A platform library is a library used to complement a specific platform. In the case of libadwaita, the platform it targets is the GNOME desktop. Porting Widgets to Libadwaita Some GNOME widgets from GTK 3 (or earlier versions of GTK 4) were removed or deprecated in GTK 4 and were reimplemented in / transferred to libadwaita, for example: GtkDialog → AdwDialog1 GtkInfoBar → AdwBanner These aforementioned widgets only benefited GNOME apps, as they were strictly designed to provide widgets that conformed to the GNOME HIG. Non-GNOME apps usually didn’t use these widgets, so they were practically irrelevant to everyone else. In addition, libadwaita introduced several widgets as counterparts to GTK 4 to comply with the HIG: GtkHeaderBar → AdwHeaderBar GtkAlertDialog → AdwAlertDialog GtkAboutDialog → AdwAboutDialog Similarly, these aforementioned GTK 4 (the ones starting with Gtk) widgets are not designed to comply with the GNOME HIG. Since GTK 4 widgets are supposed to be general-purpose, they should not be platform-specific; the HIG no longer has any influence on GTK, only on the development of libadwaita. Scope of Support The main difference between GTK 4 and libadwaita is the scope of support, specifically the priorities in terms of the GNOME desktop, and desktop environment and OS support. While most resources are dedicated to GNOME desktop integration, GTK 4 is not nearly as focused on the GNOME desktop as libadwaita. GTK 4, while opinionated, still tries to get closer to the traditional desktop metaphor by providing these general-purpose widgets, while libadwaita provides custom widgets to conform to the GNOME HIG. Since libadwaita is only made for the GNOME desktop, and the GNOME desktop is primarily officially supported on Linux, libadwaita thus primarily supports Linux. In contrast, GTK is officially supported on all major operating systems (Windows, macOS, Linux). However, since GTK 4 is mostly developed by GNOME developers, it works best on Linux and GNOME — hence “opinionated”. State of GTK 4 Today Thanks to the removal of GNOME widgets from GTK 4, GTK developers can continue to work on general-purpose widgets, without being influenced or restricted in any way by the GNOME HIG. Developers of cross-platform GTK 3 apps that rely exclusively on general-purpose widgets can be more confident that GTK 4 won’t remove these widgets, and hopefully enjoy the benefits that GTK 4 offers. At the time of writing, there are several cross-platform apps that have either successfully ported to GTK 4, or are currently in the process of doing so. To name a few: Freeciv gtk4 client, HandBrake, Inkscape, Transmission, and PulseAudio Volume Control. The LibreOffice developers are working on the GTK 4 port, with the gtk4 VCL plugin option enabled. For example, the libreoffice-fresh package from Arch Linux has it enabled. Here are screenshots of the aforementioned apps: Freeciv gtk4 client in the game view, displaying a title bar, a custom background, a menu bar, a tab view with the Chat tab selected, an entry, and a few buttons. HandBrake in the main view, displaying a title bar, a menu bar, a horizontal toolbar below it with custom buttons, entries, popover buttons, a tab view with the Summary tab selected, containing a popover button and several checkboxes. Development version of Inkscape in the main view, displaying a title bar, a menu bar, a horizontal toolbar below, vertical toolbars on the left and right, a canvas grid on the center left, a tab view on the center right with the Display Properties tab selected, and a toolbar at the bottom. LibreOffice Writer with the experimental gtk4 VCL plugin in workspace view, displaying a title bar, a menu bar, two horizontal toolbars below, a vertical toolbar on the right, a workspace grid in the middle with selected text, and a status bar at the bottom. Transmission in the main view, displaying a title bar, a menu bar, a horizontal toolbar, a filter bar, an empty field in the center of the view, and a status bar at the bottom. PulseAudio Volume Control in the Output Devices view, displaying a title bar, a tab section, a list of output devices, and a bottom bar with a combo box. A GNOME App Remains a GNOME App, Unless Otherwise Stated This is a counter-response to Thom Holwerda’s response to this article. An app targeting a specific platform will typically run best on that platform and will naturally struggle to integrate with other platforms. Whether the libraries change over time or stay the same forever, if the developers are invested in the platform they are targeting, the app will follow the direction of the platform and continue to struggle to integrate with other platforms. At best, it will integrate in other platforms by accident. In this case, developers who have and will continue to target the GNOME desktop will actively adapt their apps to follow the GNOME philosophy, for better or worse. Hamburger buttons, header bars, typography, and distinct design patterns were already present a decade ago (2014).(Source) Since other platforms were (and still are) adhering to different design languages, with or without libhandy/libadwaita, the GTK 3 apps targeting GNOME were already distinguishable a decade ago. Custom solutions such as theming were (and still are) inadequate, as there was (and still is) no 🪄 magical 🪄 solution that converts GNOME’s design patterns into their platform-agnostic counterparts. Whether the design language is part of the toolkit or a separate library has no effect on integration, because GNOME apps already looked really different long before libhandy was created, and non-GNOME apps already looked “out of place” in GNOME as well. Apps targeting a specific platform that unintentionally integrate with other platforms will eventually stop integrating with other platforms as the target platform progresses and apps adapt. In rare cases, developers may decide to no longer adhere to the GNOME HIG. Alternate Platforms While libadwaita is the most popular and widely used platform library that accompanies GTK 4, there are several alternatives to libadwaita: Granite is developed and maintained by elementary, Inc., and focuses on elementary OS and the Pantheon desktop. Apps that use Granite can be found in the elementary AppCenter. Libhelium is developed and maintained by Fyra Labs, and focuses on tauOS. Apps using libhelium can be found in the “libhelium” topics on GitHub. There are also several alternatives to libhandy: libxapp is developed and maintained by Linux Mint, and focuses on multiple GTK desktop environments, such as Cinnamon, MATE, and Xfce. libxfce4ui is developed and maintained by Xfce, and focuses on Xfce. Just like libadwaita and libhandy, these platform libraries offer custom widgets and styling that differ from GTK and are built for their respective platforms, so it’s important to realize that GTK is meant to be built with a complementary platform library that extends its functionality when targeting a specific platform. Similarly, Kirigami from KDE accompanies Qt to build Plasma apps. MauiKit from the Maui Project (another KDE project) also accompanies Qt, but targets Nitrux. Libcosmic by System76 accompanies iced to build COSMIC apps. Conclusion A cross-platform toolkit should primarily provide general-purpose widgets. Third parties should be able to extend the toolkit as they see fit through a platform library if they want to target a specific platform. As we’ve seen throughout the philosophical and technical issues with GTK, a lot of effort has gone into moving GNOME widgets from GTK 4 to libadwaita. GTK 4 will continue to provide these general-purpose widgets for apps intended to run on any desktop or OS, while platform libraries such as libadwaita, Granite and libhelium provide styling and custom widgets that respect their respective platforms. Libadwaita is targeted exclusively at the GNOME ecosystem, courtesy of the GNOME HIG. Apps built with libadwaita are intended to run best on GNOME, while GTK 4 apps that don’t come with a platform library are intended to run everywhere. The core functionality of GtkDialog, i.e. creating dialogs, has been moved to GtkWindow. ↩
Hi!
Sadly, I need to start this status update with bad news: SourceHut has
decided to terminate my contract. At this time, I’m still in the process of
figuring out what I’ll do next. I’ve marked some SourceHut-specific projects as
unmaintained, such as sr.ht-container-compose (feel free to fork of course).
I’ve handed over hut maintenance to xenrox, and I’ve started migrating a few
projects to other forges (more to follow). I will continue to maintain
projects that I still use such as soju to the extent that my free time allows.
On a more positive note, this month Igalia’s display next hackfest took
place. Although I couldn’t attend in-person, it was great to discuss in real
time with other engineers in the community about focused topics. We discussed
about color management, HDR, adaptive sync, testing, real-time scheduling,
power usage implications of the color pipeline, improved uAPI to handle KMS
atomic commit failures, hardware plane offloading, display muxes, backlight,
scaling and sharpening filters… And I probably missed a few other things.
We’ve released wlroots 0.17.3 with a bunch of new bug fixes (thanks to Simon
Zeni). The patches to add support for ICC profiles
from M. Stoeckl have been merged. I’ve continued working on the new
ext-screencopy-v1 protocol but there are a few remaining issues to address
before this is ready.
The display hackfest has motivated me to work on libliftoff. Apart from a few
bug fixes, a new API to set a timeout for the libliftoff algorithm has been
added, and some optimizations are about to get merged (one thanks to Leo Li).
The Wayland release cycle has started, we’ve merged patches to generate
validators for enum values and added a new deprecated-since XML attribute to
mark a request, event or enum as deprecated. Thanks to Ferdinand Bachmann,
kanshi has gained output defaults and aliases (useful for sharing output
configurations across multiple profiles). mako 1.9 has been released with
a new flag to toggle modes, another new flag to bypass history when dismissing
a notification, and support for compositor-side cursor images.
In IRC news, goguma now uses Material 3 (please report any regression), has
gained support for messages only visible to channel operators (STATUSMSG),
and I’ve spent a fair bit of time investigating the infamous duplicate message
bug. I have a better understanding of the issue now, but still need a bit more
time to come up with a proper fix.
Thanks to old patches sent by sitting33 that I took way too long to review,
gamja now only marks messages as read when it’s focused, shows the number of
unread highlights in the tab title, and hides the internal WHO reply chatter
from the user.
Last, I’ve released go-imap 2.0.0 beta 3 with a whole bunch of bug fixes.
Ksenia Roshchina has contributed a client implementation of the ACL IMAP
extension.
That’s all for now, see you next month!
Some weeks ago I attended for the first time the Embedded Open Source Summit. Igalia had a booth that allowed to showcase the work that we have been doing during the past years. Several igalians also gave talks there.
I gave a talk titled “Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driver for a New GPU”, where I provided an introduction about Igalia contributions to maintain the OpenGL/Vulkan stack for the Raspberry Pi, focusing on the challenges to implement the Mesa support for the Raspberry Pi 5, the last device from that series, that was release on October 2023.
In you are interested, the video and slides of my presentation is now available:
https://static.sched.com/hosted_files/eoss24/78/2024-04-eoss-apinheiro-rpi5.pdf
And as a bonus, you can see here a video showing the RPI5 running some Unreal Engine 4 Demos, and other applications:
TLDR: Thanks to José Exposito, libwacom 2.12 will support all [1] Huion and Gaomon devices when running on a 6.10 kernel.
libwacom, now almost 13 years old, is a C library that provides a bunch of static information about graphics tablets that is not otherwise available by looking at the kernel device. Basically, it's a set of APIs in the form of libwacom_get_num_buttons and so on. This is used by various components to be more precise about initializing devices, even though libwacom itself has no effect on whether the device works. It's only a library for historical reasons [2], if I were to rewrite it today, I'd probably ship libwacom as a set of static json or XML files with a specific schema.
Here are a few examples on how this information is used: libinput uses libwacom to query information about tablet tools.The kernel event node always supports tilt but the individual tool that is currently in proximity may not. libinput can get the tool ID from the kernel, query libwacom and then initialize the tool struct correctly so the compositor and Wayland clients will get the right information. GNOME Settings uses libwacom's information to e.g. detect if a tablet is built-in or an external display (to show you the "Map to Monitor" button or not, if builtin), GNOME's mutter uses the SVGs provided by libwacom to show you an OSD where you can assign keystrokes to the buttons. All these features require that the tablet is supported by libwacom.
Huion and Gamon devices [3] were not well supported by libwacom because they re-use USB ids, i.e. different tablets from seemingly different manufacturers have the same vendor and product ID. This is understandable, the 16-bit product id only allows for 65535 different devices and if you're a company that thinks about more than just the current quarterly earnings you realise that if you release a few devices every year (let's say 5-7), you may run out of product IDs in about 10000 years. Need to think ahead! So between the 140 Huion and Gaomon devices we now have in libwacom I only counted 4 different USB ids.
Nine years ago we added name matching too to work around this (i.e. the vid/pid/name combo must match) but, lo and behold, we may run out of unique strings before the heat death of the universe so device names are re-used too! [4] Since we had no other information available to userspace this meant that if you plugged in e.g. a Gaomon M106 and it was detected as S620 and given wrong button numbers, a wrong SVG, etc.
A while ago José got himself a tablet and started contributing to DIGIMEND (and upstreaming a bunch of things). At some point we realised that the kernel actually had the information we needed: the firmware version string from the tablet which conveniently gave us the tablet model too. With this kernel patch scheduled for 6.10 this is now exported as the uniq property (HID_UNIQ in the uevent) and that means it's available to userspace. After a bit of rework in libwacom we can now match on the trifecta of vid/pid/uniq or the quadrella of vid/pid/name/uniq. So hooray, for the first time we can actually detect Huion and Gaomon devices correctly.
The second thing Jose did was to extract all model names from the .deb packages Huion and Gaomon provide and auto-generate all libwacom descriptions for all supported devices. Which meant, in one pull request we added around 130 devices. Nice!
As said above, this requires the future kernel 6.10 but you can apply the patches to your current kernel if you want. If you do have one of the newly added devices, please verify the .tablet file for your device and let us know so we can remove the "this is autogenerated" warnings and fix any issues with the file. Some of the new files may now take precedence over the old hand-added ones so over time we'll likely have to merge them. But meanwhile, for a brief moment in time, things may actually work.
[1] fsvo of all but should be all current and past ones provided they were supported by Huions driver
[2] anecdote: in 2011 Jason Gerecke from Wacom and I sat down to and decided on a generic tablet handling library independent of the xf86-input-wacom driver. libwacom was supposed to be that library but it never turned into more than a static description library, libinput is now what our original libwacom idea was.
[3] and XP Pen and UCLogic but we don't yet have a fix for those at the time of writing
[4] names like "HUION PenTablet Pen"...
We’re excited to announce the details of our upcoming 2024 Linux Display Next
Hackfest in the beautiful city of A Coruña, Spain!
This year’s hackfest will be hosted by Igalia and will take place from May 14th
to 16th. It will be a gathering of minds from a diverse range of companies and
open source projects, all coming together to share, learn, and collaborate
outside the traditional conference format.
Who’s Joining the Fun?
We’re excited to welcome participants from various backgrounds, including:
GPU hardware vendors;
Linux distributions;
Linux desktop environments and compositors;
Color experts, researchers and enthusiasts;
This diverse mix of backgrounds are represented by developers from several
companies working on the Linux display stack: AMD, Arm, BlueSystems, Bootlin,
Collabora, Google, GravityXR, Igalia, Intel, LittleCMS, Qualcomm, Raspberry Pi,
RedHat, SUSE, and System76. It’ll ensure a dynamic exchange of perspectives and
foster collaboration across the Linux Display community.
Please take a look at the
list of participants
for more info.
What’s on the Agenda?
The beauty of the hackfest is that the agenda is driven by participants! As
this is a hybrid event, we decided to improve the experience for remote
participants by creating a dedicated space for them to propose topics and some
introductory talks in advance. From those inputs, we defined a schedule that
reflects the collective interests of the group, but is still open for
amendments and new proposals. Find the schedule details in the official event
webpage.
Expect discussions on:
KMS Color/HDR
Proposal with new DRM object type:
Brief presentation of GPU-vendor features;
Status update of plane color management pipeline per vendor on Linux;
HDR/Color Use-cases:
HDR gainmap images and how should we think about HDR;
Google/ChromeOS GFX view about HDR/per-plane color management, VKMS and lessons learned;
Post-blending Color Pipeline.
Color/HDR testing/CI
VKMS status update;
Chamelium boards, video capture.
Wayland protocols
color-management protocol status update;
color-representation and video playback.
Display control
HDR signalling status update;
backlight status update;
EDID and DDC/CI.
Strategy for video and gaming use-cases
Multi-plane support in compositors
Underlay, overlay, or mixed strategy for video and gaming use-cases;
KMS Plane UAPI to simplify the plane arrangement problem;
Shared plane arrangement algorithm desired.
HDR video and hardware overlay
Frame timing and VRR
Frame timing:
Limitations of uAPI;
Current user space solutions;
Brainstorm better uAPI;
Cursor/overlay plane updates with VRR;
KMS commit and buffer-readiness deadlines;
Power Saving vs Color/Latency
ABM (adaptive backlight management);
PSR1 latencies;
Power optimization vs color accuracy/latency requirements.
Content-Adaptive Scaling & Sharpening
Content-Adaptive Scalers on display hardware;
New drm_colorop for content adaptive scaling;
Proprietary algorithms.
Display Mux
Laptop muxes for switching of the embedded panel between the integrated GPU and the discrete GPU;
Seamless/atomic hand-off between drivers on Linux desktops.
Real time scheduling & async KMS API
Potential benefits: lower latency input feedback, better VRR handling, buffer synchronization, etc.
Issues around “async” uAPI usage and async-call handling.
In-person, but also geographically-distributed event
This year Linux Display Next hackfest is a hybrid event, hosted onsite at the
Igalia offices and available for remote attendance. In-person participants will
find an environment for networking and brainstorming in our inspiring and
collaborative office space. Additionally, A Coruña itself is a gem waiting to
be explored, with stunning beaches, good food, and historical sites.
Semi-structured structure: how the 2024 Linux Display Next Hackfest will work
Agenda: Participants proposed the topics and talks for discussing in
sessions.
Interactive Sessions: Discussions, workshops, introductory talks and
brainstorming sessions lasting around 1h30. There is always a starting point
for discussions and new ideas will emerge in real time.
Immersive experience: We will have coffee-breaks between sessions and
lunch time at the office for all in-person participants. Lunches and
coffee-breaks are sponsored by Igalia. This will keep us sharing knowledge and
in continuous interaction.
Spaces for all group sizes: In-person participants will find different
room sizes that match various group sizes at Igalia HQ. Besides that, there
will be some devices for showcasing and real-time demonstrations.
Social Activities: building connections beyond the sessions
To make the most of your time in A Coruña, we’ll be organizing some social activities:
First-day Dinner: In-person participants will enjoy a Galician dinner on
Tuesday, after a first day of intensive discussions in the hackfest.
Getting to know a little of A Coruña: Finding out a little about A Coruña
and current local habits.
On Thursday afternoon, we will close the 2024 Linux Display Next hackfest
with a guided tour of the Museum of Galicia’s favorite beer brand, Estrella
Galicia. The guided tour covers the eight sectors of the museum and ends with
beer pouring and tasting. After this experience, a transfer bus will take us to
the Maria Pita square.
At Maria Pita square we will see the charm of some historical landmarks of
A Coruña, explore the casual and vibrant style of the city center and taste
local foods while chatting with friends.
Sponsorship
Igalia sponsors lunches and coffee-breaks on hackfest days, Tuesday’s dinner,
and the social event on Thursday afternoon for in-person participants.
We can’t wait to welcome hackfest attendees to A Coruña! Stay tuned for further
details and outcomes of this unconventional and unique experience.
With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the TensorFlow Lite driver for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.Philipp Zabel of Pengutronix has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the weights tensor.Weights, the other kind of them.Looked a bit further, and found that this format is very similar to what Philip had been reverse engineering and implementing support for.After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the run-length and Huffman encodings were the same, with only a few differences such as where and how the bias values were stored.With a few changes to Philip's work-in-progress branch I got my first tests passing on the Libre Computer Solitude SBC board.Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.
Discussions about rebase vs. merge are familiar territory for anybody with an interest in version control in general and git in particular. I want to finally give a more permanent home to an idea that I have expressed in the past and that I've occasionally seen others hint at in those discussions as well.There are multiple camps in these discussions that have slightly different ideas about how and for what purposes git should be used.The first major axis of disagreement is whether history needs to be git bisect-able. Outside of my own little hobby projects, I've always worked on projects for which bisectability was important. This has generally been because their scope was such that CI simply had no chance to cover all uses of the software. Bug reports that can be traced to regressions from weeks or even months ago are not frequent per se, but they have always been frequent enough to matter. git bisect is an essential tool for finding those regression points when they happen. Not all projects are like that, but for projects which are, the notion of an "atomic" change to the project's main development branch (or branches) is important.The second major axis of disagreement is whether the development history of those "atomic" changes is important enough to preserve. The original git development workflow does not consider this to be important: developers send around and review multiple iterations of a change, but only the final version of the change goes into the permanent record of the git repository. I tend to agree with that view. I have very occasionally found it useful to go back and read through the comments on a pull request that lead to a change months ago (or the email thread in projects that use an email workflow), but I have never found it useful to look at older versions of a change.Some people seem to really care about this kind of history, though. They're the people who argue for a merge-based workflow for pull requests on GitHub (but against force-pushes to the same) and who have built hacks for bisectability and readability of history like --first-parent. I'm calling that a hack because it does not compose well. It works for projects whose atomic change history is essentially linear, but it breaks down once the history becomes more complex. What if the project occasionally has a genuine merge? Now you'd want to apply --first-parent for most merge commits but not all. Things get messy.One final observation. Even "my" camp, which generally prefers to discard development history leading up to the atomic change in a main development branch, does want to preserve a kind of history that is currently not captured by git's graph. git revert inserts the hash of the commit that was reverted into the commit message. Similarly, git cherry-pick optionally inserts the hash of the commit that was cherry-picked into the commit message.In other words, there is a kind of history for whose preservation at least in some cases there seems to be a broad consensus. This kind of history is distinct from the history that is captured by commit parent links. Looked at in this light, the idea is almost obvious: make this history an explicit part of git commit metadata.The gist of it would be this. Every commit has a (often empty) list of historical commit references explaining the origins of the diff that is implicitly represented by the commit; let's call them diff-parents. The diff-parents are an ordered list of references to commits, each of them with a "reverted" bit that can optionally be set.The history of a revert can be encoded by making the reverted commit a diff-parent with the "reverted" bit set. The history of a cherry-pick can be encoded similarly, with the "reverted" bit clear. When we perform a simple rebase, each new commit has an obvious diff-parent. When commits are squashed during a rebase, the sequence of squashed commits becomes the list of diff-parents of the newly formed commit. GitHub users who like to preserve all development history can use the "squash" option when landing pull requests and have the history be preserved via the list of diff-parents. git commit --amend can similarly record the original commit as diff-parent.This is an idea and not a fully fleshed-out plan. There are obviously a whole bunch of tricky questions to answer. For example: How does this all fit into git's admittedly often byzantine CLI? Can merge commits be diff-parents, and how would that work? Can we visualize the difference between a commit and its diff-parents? (Hint: Here's an idea)Diff-parents are a source of potential information leaks. This is not a problem specific to the idea of diff-parents; it is a general problem with the idea of preserving all history. Imagine some developer accidentally commits some credentials in their local clone of a repository and then uses git commit --amend to remove them again. Whoops, the commit that contains the credentials is still referenced as a diff-parent. Will it (and therefore the credentials) be published to the world for all to see when the developers pushes their branch to GitHub? This needs to be taken seriously.So there are a whole bunch of issues that would have to be addressed for this idea to work well. I believe those issues to be quite surmountable in principle, but given the state of git development (where GitHub, which to many is almost synonymous with git, doesn't even seem to be able to understand how git was originally meant to be used) I am not particularly optimistic. Still, I think it's a good idea, and I'd love to see it or something like it in git.
The firmware which drm/kms drivers need is becoming bigger and bigger and there is a push to move to generating a generic initramfs on distro's builders and signing the initramfs with the distro's keys for security reasons. When targetting desktops/laptops (as opposed to VMs) this means including firmware for all possible GPUs which leads to a very big initramfs.This has made me think about dropping the GPU drivers from the initramfs and instead make plymouth work well/better with simpledrm (on top of efifb). A while ago I discussed making this change for Fedora with the Red Hat graphics team spoiler: For now nothing is going to change.Let me repeat that: For now there are no plans to implement this idea so if you believe you would be impacted by such a change: Nothing is going to change.Still this is something worthwhile to explore further.Advantages:1. Smaller initramfs size:* E.g. a host specific initramfs with amdgpu goes down from 40MB to 20MB* No longer need to worry about Nvidia GSP firmware size in initrd* This should also significantly shrink the initrd used in liveimages2. Faster boot times:* Loading + unpacking the initrd can take a surprising amount of time. E.g. on my old AMD64 embedded PC (with BobCat cores) the reduction of 40MB -> 20MB in initrd size shaves approx. 3 seconds of initrd load time + 0.6s seconds from the time it takes to unpack the initrd* Probing drm connectors can be slow and plymouth blocks the initrd -> rootfs transition while it is busy probing3. Earlier showing of splash. By using simpledrm for the splash the splash can be shown earlier, avoiding the impression the machine is hanging during boot. An extreme example of this is my old AMD64 embedded PC, where the time to show the first frame of the splash goes down from 47 to 9 seconds.4. One less thing to worry about when trying to create a uniform desktop pre-generated and signed initramfs (these would still need support for nvme + ahci and commonly used rootfs + lvm + luks). Disadvantages:Doing this will lead to user visible changes in the boot process:1. Secondary monitors not lit up by the efifb will stay black during full-disk encryption password entry, since the GPU drivers will now only load after switching to the encrypted root. This includes any monitors connected to the non boot GPU in dual GPU setups.Generally speaking this is not really an issue, the secondary monitors will light up pretty quickly after the switch to the real rootfs. However when booting a docked laptop, with the lid closed and the only visible monitor(s) are connected to the non boot GPU, then the full-disk encryption password dialog will simply not be visible at all.This is the main deal-breaker for not implementing this change. Note because of the strict version lock between kernel driver and userspace with nvidia binary drivers, the nvidia binary drivers are usually already not part of the initramfs, so this problem already exists and moving the GPU drivers out of the initramfs does not really make this worse.2. With simpledrm plymouth does not get the physical size of the monitor, so plymouth will need to switch to using heuristics on the resolution instead of DPI info to decide whether or not to use hidpi (e.g. 2x size) rendering and even when switching to the real GPU driver plymouth needs to stay with its initial heuristics based decision to avoid the scaling changing when switching to the real driver which would lead to a big visual glitch / change halfway through the boot.This may result in a different scaling factor for some setups, but I do not expect this really to be an issue.3. On some (older) systems the efifb will not come up in native mode, but rather in 800x600 or 1024x768.This will lead to a pretty significant discontinuity in the boot experience when switching from say 800x600 to 1920x1080 while plymouth was already showing the spinner at 800x600.One possible workaround here is to add: 'video=efifb:auto' to the kernel commandline which will make the efistub switch to the highest available resolution before starting the kernel. But it seems that the native modes are simply not there on systems which come up at 800x600 / 1024x768 so this does not really help.This does not actually break anything but it does look a bit ugly. So we will just need to document this as an unfortunate side-effect of the change and then we (and our users) will have to live with this (on affected hardware).4. On systems where a full modeset is done the monitor going briefly black from the modeset will move from being just before plymouth starts to the switch from simpledrm drm to the real driver. So that is slightly worse. IMHO the answer here is to try and get fast modesets working on more systems.5. On systems where the efifb comes up in the panel's native mode and a fast modeset can be done, the spinner will freeze for a (noticeable) fraction of a second as the switch to the real driver happens.Preview:To get an impression what this will look / feel like on your own systems, you can implement this right now on Fedora 40 with some manual configuration changes:1. Create /etc/dracut.conf.d/omit-gpu-drivers.conf with:omit_drivers+=" amdgpu radeon nouveau i915 "And then run "sudo dracut -f" to regenerate your current initrd.2. Add to kernel commandline: "plymouth.use-simpledrm"3. Edit /etc/selinux/config, set SELINUX=permissive this is necessary because ATM plymouth has issues with accessing drm devices after the chroot from the initrd to the rootfs.Note this all assumes EFI booting with efifb used to show the plymouth boot splash. For classic BIOS booting it is probably best to stick with having the GPU drivers inside the initramfs. comments
It’s been around 6 months since the GNOME Foundation was joined by our new Executive Director, Holly Million, and the board and I wanted to update members on the Foundation’s current status and some exciting upcoming changes.
Finances
As you may be aware, the GNOME Foundation has operated at a deficit (nonprofit speak for a loss – ie spending more than we’ve been raising each year) for over three years, essentially running the Foundation on reserves from some substantial donations received 4-5 years ago. The Foundation has a reserves policy which specifies a minimum amount of money we have to keep in our accounts. This is so that if there is a significant interruption to our usual income, we can preserve our core operations while we work on new funding sources. We’ve now “hit the buffers” of this reserves policy, meaning the Board can’t approve any more deficit budgets – to keep spending at the same level we must increase our income.
One of the board’s top priorities in hiring Holly was therefore her experience in communications and fundraising, and building broader and more diverse support for our mission and work. Her goals since joining – as well as building her familiarity with the community and project – have been to set up better financial controls and reporting, develop a strategic plan, and start fundraising. You may have noticed the Foundation being more cautious with spending this year, because Holly prepared a break-even budget for the Board to approve in October, so that we can steady the ship while we prepare and launch our new fundraising initiatives.
Strategy & Fundraising
The biggest prerequisite for fundraising is a clear strategy – we need to explain what we’re doing and why it’s important, and use that to convince people to support our plans. I’m very pleased to report that Holly has been working hard on this and meeting with many stakeholders across the community, and has prepared a detailed and insightful five year strategic plan. The plan defines the areas where the Foundation will prioritise, develop and fund initiatives to support and grow the GNOME project and community. The board has approved a draft version of this plan, and over the coming weeks Holly and the Foundation team will be sharing this plan and running a consultation process to gather feedback input from GNOME foundation and community members.
In parallel, Holly has been working on a fundraising plan to stabilise the Foundation, growing our revenue and ability to deliver on these plans. We will be launching a variety of fundraising activities over the coming months, including a development fund for people to directly support GNOME development, working with professional grant writers and managers to apply for government and private foundation funding opportunities, and building better communications to explain the importance of our work to corporate and individual donors.
Board Development
Another observation that Holly had since joining was that we had, by general nonprofit standards, a very small board of just 7 directors. While we do have some committees which have (very much appreciated!) volunteers from outside the board, our officers are usually appointed from within the board, and many board members end up serving on multiple committees and wearing several hats. It also means the number of perspectives on the board is limited and less representative of the diverse contributors and users that make up the GNOME community.
Holly has been working with the board and the governance committee to reduce how much we ask from individual board members, and improve representation from the community within the Foundation’s governance. Firstly, the board has decided to increase its size from 7 to 9 members, effective from the upcoming elections this May & June, allowing more voices to be heard within the board discussions. After that, we’re going to be working on opening up the board to more participants, creating non-voting officer seats to represent certain regions or interests from across the community, and take part in committees and board meetings. These new non-voting roles are likely to be appointed with some kind of application process, and we’ll share details about these roles and how to be considered for them as we refine our plans over the coming year.
Elections
We’re really excited to develop and share these plans and increase the ways that people can get involved in shaping the Foundation’s strategy and how we raise and spend money to support and grow the GNOME community. This brings me to my final point, which is that we’re in the run up to the annual board elections which take place in the run up to GUADEC. Because of the expansion of the board, and four directors coming to the end of their terms, we’ll be electing 6 seats this election. It’s really important to Holly and the board that we use this opportunity to bring some new voices to the table, leading by example in growing and better representing our community.
Allan wrote in the past about what the board does and what’s expected from directors. As you can see we’re working hard on reducing what we ask from each individual board member by increasing the number of directors, and bringing additional members in to committees and non-voting roles. If you’re interested in seeing more diverse backgrounds and perspectives represented on the board, I would strongly encourage you consider standing for election and reach out to a board member to discuss their experience.
Thanks for reading! Until next time.
Best Wishes,RobPresident, GNOME Foundation
Update 2024-04-27: It was suggested in the Discourse thread that I clarify the interaction between the break-even budget and the 1M EUR committed by the STF project. This money is received in the form of a contract for services rather than a grant to the Foundation, and must be spent on the development areas agreed during the planning and application process. It’s included within this year’s budget (October 23 – September 24) and is all expected to be spent during this fiscal year, so it doesn’t have an impact on the Foundation’s reserves position. The Foundation retains a small % fee to support its costs in connection with the project, including the new requirement to have our accounts externally audited at the end of the financial year. We are putting this money towards recruitment of an administrative assistant to improve financial and other operational support for the Foundation and community, including the STF project and future development initiatives.
(also posted to GNOME Discourse, please head there if you have any questions or comments)