Port of waifu2x to pure kotlin + opencl. Anime-style upscaler and noise reductor based on convolutional neural networks using coffee trained models

Overview

Waifu2x implementation in pure Kotlin

Build Status

Waifu2x is a upscaler/noise-reductor for anime-style images based on convolutional neural networks. Original implementation was written in LUA, and there is a very simple python based implementation. It uses a caffee-based deep learning models.

Kotlin implementation uses Korlib's Korim for image processing. And it includes code to process convulational 3x3 kernels on a float matrix.

Version 4 uses OpenCL

How to use CLI?

You can grab a precompiled jar from github's the Releases page

Or build from source:

git clone https://github.com/soywiz/kaifu2x.git
cd kaifu2x
./gradlew fatJar
cd build/libs
java -jar kaifu2x-all.jar -n0 -s2 input.png output.png

Install kaifu2x binary in /usr/local/bin:

./gradlew installCli

How to use CLI using kscript?

Create a file named kaifu2x with this contents:

#!/usr/bin/env kscript
//DEPS com.soywiz:kaifu2x:0.3.0
com.soywiz.kaifu2x.Kaifu2xCli.main(args)

Run chmod +x kaifu2x to give permissions.

You will need kscript:

  • Using brew run brew install holgerbrandl/tap/kscript
  • Using sdkman, install sdk install kscript

Note that first time you call the script it will take sometime, but further executions will be faster.

If you want to try it out without installing anything else than kscript or manually downloading any image:

brew install holgerbrandl/tap/kscript
kscript https://raw.githubusercontent.com/soywiz/kaifu2x/d72ee3dbd6f735f72da46d628390a5298e07f558/kaifu2x.kscript -s2 https://raw.githubusercontent.com/soywiz/kaifu2x/d72ee3dbd6f735f72da46d628390a5298e07f558/docs/goku_small_bg.png goku_small_bg.2x.png

How to use as library?

It is published to maven central. In your build.gradle (or maven equivalent):

compile "com.soywiz:kaifu2x:0.4.0"

Exposed API:

package com.soywiz.kaifu2x

object Kaifu2x {
	suspend fun noiseReductionRgba(image: Bitmap32, noise: Int, channels: List<BitmapChannel> = listOf(BitmapChannel.Y, BitmapChannel.A), parallel: Boolean = true, chunkSize: Int = 128, output: PrintStream? = System.err): Bitmap32
	suspend fun scaleRgba(image: Bitmap32, scale: Int, channels: List<BitmapChannel> = listOf(BitmapChannel.Y, BitmapChannel.A), parallel: Boolean = true, chunkSize: Int = 128, output: PrintStream? = System.err): Bitmap32
}

Help

kaifu2x - 0.4.0 - 2017

Usage: kaifu2x [switches] <input.png> <output.png>

Available switches:
  -h        - Displays this help
  -v        - Displays version
  -n[0-3]   - Noise reduction [default to 0 (no noise reduction)]
  -s[1-2]   - Scale level 1=1x, 2=2x [default to 1 (no scale)]
  -cs<X>    - Chunk size [default to 128]
  -q[0-100] - The quality of the output (JPG, PNG) [default=100]
  -mt       - Multi Threaded [default]
  -st       - Single Threaded
  -cl       - Process Luminance
  -cla      - Process Luminance & Alpha [default]
  -clca     - Process Luminance & Chroma & Alpha

Some numbers (v0.2.0)

Note: As a performance example in a [email protected] it takes 4 minutes to process a single component for a 470x750 image for an output of 940x1500.

And memory used: Used: 1.6GB, Max Heap: 2GB

Think that in the last step it has to keep 256 times (128 for the input, and 128 for the output) the size of your uncompressed 2x image in memory.

So a 940x1500 float components, requires 5.5MB, and 256 times: 1408 MB + some extra stuff like temp buffers and so.

NOTE: Future versions will use less memory: https://github.com/soywiz/kaifu2x/issues/1 but that will require fixing an issue on edges (probably padding-related).

Some numbers (v0.3.0)

Note: As a performance example in a [email protected] it takes 2 minutes to scale 2x a single component for a 470x750 image for an output of 940x1500.

Version 0.3.0, successfully part the image in chunks of 128x128 by default (you can adjust chunk size). So the memory requirements are now much lower. 128*128*4*256=16MB, and it is typical that the cli uses around ~50MB for any image size, though times still are slow until hardware acceleration is implemented. Also processor caches are most likely to hit, so for bigger images this is better.

Some numbers (v0.4.0)

This versions already uses OpenCL, so it can achieve much better numbers with a good GPU.

How does this work?

Nearest neighbour scaling

First of all, we scale the image using a nearest neighbour approach. That's it:

YCbCr color space

In waifu2x we have to compute images component-wise. So we cannot process RGBA at once, and we have to process each component first.

With RGB, we have to process all three components to get a reasonable result. But with other color spaces we can reduce the number of components we process and also improve end quality.

With YCbCr or YUV color spaces we divide it in luminance component and chroma components. Separating it, we can just process luminance and keep chromance intact.

YCbCr decomposition representing each component as grayscale:

So for waifu2x we can just use this component, to reduce times by three with a pretty acceptable result:

Also to get good enough results we have to process alpha channel too, in the case there is alpha information on it.

Waifu2x input

The input of waifu2x is a pixelated 2x2 grayscale image represented as floats in the range of [0f, 1f]

Optimizations done

Sliding memory reading for convolution kernel

Reduced from 30 seconds to 24 seconds in a [email protected]

The initial optimization I have done to the code is to reduce memory reading at the convolution kernel. Waifu2x model uses a convolution matrix of 3x3. For each single component, it gather 3x3 near components, multiply them by weight matrix and sum the result. This is a task that SIMD instructions perform very well, as well as GPUs. But in this case I'm doing a scalar implementation in pure kotlin. So for each element in the matrix, I read 9 contiguous elements and multiply per weights. Weights are already locals, so potentially in registers. But what about reading? Well, actually you are reading 9 times each component, which can be optimized using sliding windows since we are doing this sequentially.

Consider this matrix:

0 1 2 3
4 5 6 7
8 9 a b
c d e f

If we are processing the convolution from left to right, then top to down. First we would read:

0 1 2
4 5 6
8 9 a

then

1 2 3
5 6 7
9 a b

In each contiguous step (from left to right) we have 6 values that are the same that in the previous step, but shifted. So we only have 3 new values that we have to read from memory in each step.

Sadly we can't do the same for vertical, so we still are reading 3 times the required pixels, but much better than 9 times. We cannot reuse partial results like in dynamic programming, so probably not much to optimize here except for SIMD.

Parallelize

Reduced from 24 seconds to 12 seconds in a [email protected]

This one is pretty straight forward: and it is to parallelize work in threads. I have tried several places for parallelizing to reduce the overhead. Since each of the 7 steps depends on the previous ones, that part is not parallelizable. Also it is not too big. So I have tried to parallelize the convolution work in a per row basis. But there were too much calls to this so the overhead was big. In the end I placed the parallelization in a previous step.

Limit memory and unified implementation

At this point I unified single threaded and multithreaded implementations. I used a fixed thread pool and manually assigned tasks so each thread just requires two additional arrays.

Limit allocations

In order to avoid tons of allocations, copies and so on, I preallocated all the required arrays for each step at once. Then instead of using immutable arrays, I changed operatins to be mutable and to work on existing arrays.

Future optimizations

Since we can't do SIMD optimizations manually in the JVM. And there are no guarantees that the JVM uses SIMD instructions, our only option here is to use libraries for parallelizing mathematical operations either in the CPU and the GPU or even with shaders (GlSl for example). That's probably out of scope for this library (at least at this point), since the aim here is to illustrate how does this works internally and to provide a portable implementation that works out of the box on mac.

You might also like...
New style for app design Online Sunglasses Shop App UI made in Jetpack Compose.๐Ÿ˜‰๐Ÿ˜Ž
New style for app design Online Sunglasses Shop App UI made in Jetpack Compose.๐Ÿ˜‰๐Ÿ˜Ž

JetSunglassUI-Android New style for app design Online Sunglasses Shop App UI made in Jetpack Compose. ๐Ÿ˜‰ ๐Ÿ˜Ž (Navigation Components, Dagger-Hilt, Mater

YAML-based source-based kotlin module descriptors

kproject - Liberate your Kotlin projects YAML-based source-based kotlin module descriptors that runs on top of gradle. Define your kotlin multiplatfor

๐Ÿ›’ Mercado Libre App Clone using modern Android development with Hilt, Coroutines, Jetpack (Room, ViewModel), and Jetpack Compose based on MVVM architecture.
๐Ÿ›’ Mercado Libre App Clone using modern Android development with Hilt, Coroutines, Jetpack (Room, ViewModel), and Jetpack Compose based on MVVM architecture.

Meli Clone ๐Ÿ›’ Mercado Libre App Clone using modern Android development with Hilt, Coroutines, Jetpack (Room, ViewModel), and Jetpack Compose based on

EduApp is a mini e-learning platform based on udemy's public api. It has 4 main navigation destinations (Home, Search, Wishlist, Cart). Users can search courses from different categories and get real-time results from the api using Chips for a smooth filtering experience. It has different theme for dark mode. ๐Ÿฆ A Disney app using transformation motions based on MVVM (ViewModel, Coroutines, Flow, LiveData, Room, Repository, Koin) architecture.
๐Ÿฆ A Disney app using transformation motions based on MVVM (ViewModel, Coroutines, Flow, LiveData, Room, Repository, Koin) architecture.

DisneyMotions A demo Disney app using transformation motions based on MVVM architecture. The motion system is included in the 1.2.0-alpha05 released m

๐Ÿš€ ๐Ÿฅณ MVVM based sample currency converter application using Room, Koin, ViewModel, LiveData, Coroutine
๐Ÿš€ ๐Ÿฅณ MVVM based sample currency converter application using Room, Koin, ViewModel, LiveData, Coroutine

Currency Converter A demo currency converter app using Modern Android App Development techniques Tech stack & Open-source libraries Minimum SDK level

A complete Kotlin application built to demonstrate the use of Modern development tools with best practices implementation using multi-module architecture developed using SOLID principles
A complete Kotlin application built to demonstrate the use of Modern development tools with best practices implementation using multi-module architecture developed using SOLID principles

This repository serves as template and demo for building android applications for scale. It is suited for large teams where individuals can work independently on feature wise and layer wise reducing the dependency on each other.

Mocking for Kotlin/Native and Kotlin Multiplatform using the Kotlin Symbol Processing API (KSP)

Mockative Mocking for Kotlin/Native and Kotlin Multiplatform using the Kotlin Symbol Processing API (KSP). Installation Mockative uses KSP to generate

Comments
  • About trap 6 on macOS

    About trap 6 on macOS

    Hello. I run v0.3.0 -- it is good. But on v0.4 I have following problem

    Reading 235520.jpg...Ok
    Reading scale2.0x_model.json...Ok
    [scale2] %.1f%% - ELA: 0.0 - ETA: 00:00:01 - MEM: 596:31:23 Abort trap: 6
    

    I fastly looked to the code it uses half precision but on my MacBook Pro Mid 2013 with HD Graphics 4000 there is no half precision -- can this be the cause of the error?

    opened by misha-plus 4
  • Kernel and program error with AMD GPU

    Kernel and program error with AMD GPU

    Hi

    I have tested on various environments(macOS, EC2 GPU, GPUEater). It worked on macOS and EC2 GPU, but I got error with AMD GPU

    These I faced situation.

    1. I got error and modified source code.
    /tmp/AMD_735_19/t_735_21.cl:12:13: error: access qualifier can only be used for pipe and image type
                __read_only global const half *krn,
                ^
    /tmp/AMD_735_19/t_735_21.cl:13:13: error: access qualifier can only be used for pipe and image type
                __read_only global const half *bias,
                ^
    /tmp/AMD_735_19/t_735_21.cl:14:13: error: access qualifier can only be used for pipe and image type
                __read_only global const float *in,
                ^
    /tmp/AMD_735_19/t_735_21.cl:15:13: error: access qualifier can only be used for pipe and image type
                __write_only global float *out,
                ^
    /tmp/AMD_735_19/t_735_21.cl:55:13: error: access qualifier can only be used for pipe and image type
                __read_only global const float *in,
                ^
    /tmp/AMD_735_19/t_735_21.cl:56:13: error: access qualifier can only be used for pipe and image type
                __write_only global half *out
                ^
    6 errors generated.
    

    I could be fixed by removing __read_only , __write_only.

    1. stopped code at Kotlin source, this is error log.
    [scale2] %.1f%% - ELA: 0.0 - ETA: 00:00:03 - MEM: 596:31:23 com.jogamp.opencl.CLException$CLInvalidArgSizeException: error setting arg 7 to value java.nio.DirectByteBuffer[pos=0 lim=24 cap=24] of size 8 of CLKernel [id: 140470940142384 name: waifu2x] [error: CL_INVALID_ARG_SIZE]
    	at com.jogamp.opencl.CLException.newException(CLException.java:84)
    	at com.jogamp.opencl.CLKernel.setArgument(CLKernel.java:266)
    	at com.jogamp.opencl.CLKernel.setArg(CLKernel.java:194)
    	at com.jogamp.opencl.CLKernel.putArg(CLKernel.java:126)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2x(Kaifu2x.kt:285)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2x(Kaifu2x.kt:310)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2x(Kaifu2x.kt:327)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2x(Kaifu2x.kt:355)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2xChunkedYCbCr(Kaifu2x.kt:398)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2xChunkedRgba(Kaifu2x.kt:439)
    	at com.soywiz.kaifu2x.Kaifu2xOpencl.waifu2xChunkedRgbaFast(Kaifu2x.kt:455)
    	at com.soywiz.kaifu2x.Kaifu2x$scaleRgba$2.doResume(Kaifu2x.kt:551)
    	at com.soywiz.kaifu2x.Kaifu2x$scaleRgba$2.invoke(Kaifu2x.kt)
    	at com.soywiz.kaifu2x.Kaifu2x$scaleRgba$2.invoke(Kaifu2x.kt:513)
    	at com.soywiz.kaifu2x.Kaifu2xKt.processMeasurer(Kaifu2x.kt:568)
    	at com.soywiz.kaifu2x.Kaifu2x.scaleRgba(Kaifu2x.kt:550)
    	at com.soywiz.kaifu2x.Kaifu2x.scaleRgba$default(Kaifu2x.kt:545)
    	at com.soywiz.kaifu2x.Kaifu2xCli$main$1.doResume(Kaifu2x.kt:101)
    	at kotlin.coroutines.experimental.jvm.internal.CoroutineImpl.resume(CoroutineImpl.kt:54)
    	at kotlin.coroutines.experimental.jvm.internal.CoroutineImpl.resume(CoroutineImpl.kt:53)
    	at kotlin.coroutines.experimental.jvm.internal.CoroutineImpl.resume(CoroutineImpl.kt:53)
    	at kotlin.coroutines.experimental.SafeContinuation.resume(SafeContinuationJvm.kt:56)
    	at com.soywiz.korio.async.EventLoopJvmAndCSharp.step(EventLoopFactoryJvm.kt:105)
    	at com.soywiz.korio.async.EventLoopJvmAndCSharp.loop(EventLoopFactoryJvm.kt:122)
    	at com.soywiz.korio.async.EventLoop$Companion.main(EventLoop.kt:46)
    	at com.soywiz.korio.async.EventLoop$Companion.main(EventLoop.kt:51)
    	at com.soywiz.korio.KorioKt.Korio(Korio.kt:5)
    	at com.soywiz.kaifu2x.Kaifu2xCli.main(Kaifu2x.kt:46)
    

    Do you think these error are can fixed? I am hoping kaifu2x works great jobs.

    opened by horitaku1124 0
Owner
Carlos Ballesteros Velasco
Carlos Ballesteros Velasco
This is a sample app to demonstrate the power of using EventSourced models and the ease with which these can be modelled using Kotlin.

Lego 4 Rent This is a sample app to demonstrate the power of using EventSourced models and the ease with which these can be modelled using Kotlin. To

Nico Krijnen 4 Jul 28, 2022
Clean MVVM with eliminating the usage of context from view models by introducing hilt for DI and sealed classes for displaying Errors in views using shared flows (one time event), and Stateflow for data

Clean ViewModel with Sealed Classes Following are the purposes of this repo Showing how you can remove the need of context in ViewModels. I. By using

Kashif Mehmood 22 Oct 26, 2022
Android app for streaming and downloading Movies, TV-Series and Anime.

The main app CloudStream-3 DOWNLOAD: https://github.com/LagradOst/CloudStream-3/releases Features: AdFree, No ads whatsoever No tracking/analytics Boo

ArjixWasTaken 4 Aug 5, 2022
An AutoValue extension that generates binary and source compatible equivalent Kotlin data classes of AutoValue models.

AutoValue Kotlin auto-value-kotlin (AVK) is an AutoValue extension that generates binary-and-source-compatible, equivalent Kotlin data classes. This i

Slack 19 Aug 5, 2022
AnKunv2 is an Android application built with Jetpack Compose to stream anime on demand.

AnKunv2 AnKunv2 is an app a bit similar to YouTube but to stream anime. Updated from AnKun using Jetpack Compose. Tech Stack Kotlin AndroidX UI - Jetp

Albert E 7 Jan 7, 2023
A multifunctional Android RAT with GUI based Web Panel without port forwarding.

AIRAVAT A multifunctional Android RAT with GUI based Web Panel without port forwarding. Features Read all the files of Internal Storage Download Any M

The One And Only 336 Dec 27, 2022
To-Do-List - Create a To Do List-style App from scratch and drive the entire development process using Kotlin

To-Do-List! Crie um App no estilo "To Do List" do zero e conduza todo o processo

River Diniz 0 Feb 14, 2022
A toy port scanner to help me (and you!) learn Kotlin + Akka.

kotlin-akka-portscan A toy program to help me (and you!) learn Kotlin + Akka. butwhy.gif When I want to learn a new language, I've found it helpful to

Jeremi M Gosney 4 Jul 23, 2022
StretchKt - a Kotlin port of stretch2

StretchKt StretchKt is a Kotlin port of stretch2, which is an implementation of Flexbox originally written in Rust. The current tracked commit of the

null 2 Jun 4, 2022
Remote Administrator Tool [ RAT For Android ] No Port Forwarding

XHUNTER RAT for Android ?? ยท Telegram ยท โš–๏ธ Legal Disclaimer: For Educational Purpose Only Usage of XHUNTER for attacking targets without prior mutual

Anon 79 Dec 31, 2022