Final Work Product Submission Report (Google Summer of Code 2020)

Project Details

Project Link: https://summerofcode.withgoogle.com/projects/#5333259434590208

About SIMDe

“The SIMDe header-only library provides fast, portable implementations of SIMD intrinsics on hardware which doesn’t natively support them, such as calling SSE functions on ARM. There is no performance penalty if the hardware supports the native implementation (e.g., SSE/AVX runs at full speed on x86NEON on ARMetc.).” (From SIMDe readme)

Summary of Work

I started my work with writing portable implementations for AVX-512F/BW intrinisics and implemented over 100 intrinisics along with generating test cases, AVX-512 has introduced many new intrinisics to make up for drawbacks of previous releases and also mask{,z} versions for many intrinisics. AVX-512 also provides access to intrinisics which use 512 bit registers and doubles the width of the register compared to its predecessors releases. It can accelerate performance especially in use cases like scientific simulations, AI/deep learning, audio/video processing etc. (issue#104, issue#101)

After that I worked on writing implementations for SSE4.2 intrinisics. SSE4.2 contains instructions that deal with string and text operations which can be used to accelerate string library functions, XML processing etc, along with CRC intrinisics. The complete details of my work regarding SSE4.2 intrinisics can be found in this blog. (issue#7)

Following that I worked on portable implementations for AVX2 intrinisics and successfully completed the implementation for all the remaining intrinisics. (issue#9)

The last 5 weeks of my GSoC work period were spent on writing NEON fallbacks for SSE, SSE2, SSE4.1, SSE4.2, x86 SIMD intrinisics. After I finished most of the NEON fallbacks I moved on to working on WASM WebAssembly implementations for x86 intrinisics. For testing NEON implementations I had to use QEMU tool which is available with debian to emulate and use ARM NEON intrinisics on intel machine. The complete details about testing NEON code can be found out in this blog. For WASM I had to test using emscripten. (issue#73, issue#86)

Blogs:

Guide to Intel SSE4.2 CRC intrinisics( + implementation for SIMDe)

Optimizing horizontal operation(h{add,sub}{,s}) intrinisics for SIMDe

Introduction to ARM NEON SIMD Intrinisics (+guide for SIMDe NEON impls.)

Code Links

Pull Requests created by me in SIMDe.

Commits pushed by me which are successfully merged in master branch of SIMDe.

Sample Codes

Code for mm512_sad_epu8 (AVX-512BW)
Adding mm_i32gather_epi32 (AVX2)
Implementation for mm_crc32_u8 (SSE4.2)
mm_cmpistrz intrinisic used for string manipulations (SSE4.2)
Deinterleave operations for optimization of horizontal operations
NEON implementation for mm_dp_ps
WASM implementation for mm_mul_epi32
Generating test cases for mm_broadcastq_epi64

What is left to be done?

Currently 9 intrinisics of SSE4.2 is having complete implementation, but intrinisics of same class as mm_cmpestra belonging to the SSE4.2 release of x86 intrinisics are yet to be done. I have done enough brainstorming on this with my mentor but I was not able to complete these intrinisics due to lack of understanding of Intel’s documentation on my part and also these are very hard in general, if I get some hint on how to do this in future I will definitely get back to this.

While working on NEON implementations, I was able to complete NEON implementations for all remaining SSE3, SSSE3 intrinisics but for some of the SSE4.1, SSE4.2, SSE and SSE2 intrinisics NEON implementation is yet to be done, this is either due to the fact that there is simply no efficient NEON intrinisic available, for eg mm_cvttpd_epi32 lack equivalent NEON vcvt operation for f64_s32, or some of the intrinisics are complex enough, which makes it very hard, for example, mm_shuffle_ps has lots of cases for implementation due to different values of imm8. Complete list of x86 intrinisics which are yet to have NEON implementations. If I get some ideas regarding NEON implementations in future I will try to complete the remaining list.

Final Thoughts

GSoC has made this summer the best summer of my undergraduate journey, I have learnt the most in these months, apart from the increased knowledge about SIMD vector operations and different architectures, this program has enhanced my other skills as well whether it be related to diving into complex documentation, or reading someone else’s code, working with other developers in a team or my soft skills. I first started contributing to the project towards the end of february/start of march and I am glad to have this 6 month journey so far with the project. I am very thankful to all my mentors and especially Evan Nemerson for clearing all my doubts in weekly video calls on Jitsi :).

Currently I am working on WASM implementations for x86 instrinisics and I am hoping to continue working on it and keep contributing to the project in future, as well as help new contributors to get started with contributing to SIMDe.

Thank You

Published by masterchef2209

Hi there, myself Hidayat, I am 3rd year undergraduate @ IIIT Lucknow and Google Summer of Code 2020 student w/ Open Bioinformatics Foundation(OBF)

One thought on “Final Work Product Submission Report (Google Summer of Code 2020)

Leave a comment

Design a site like this with WordPress.com
Get started