Peter’s Substack

FFmpeg 8.0 (Part 2): How to use pad_cuda

Peter Naftaliev — Wed, 18 Mar 2026 11:03:02 GMT

FFmpeg 8.0 introduces a major performance boost with pad_cuda, an Nvidia GPU-accelerated padding filter that works with scale_cuda.

Padding and scaling are used for adjusting resolution and aspect ratio, such as converting a horizontal clip to a vertical one, which is particularly useful for platforms like TikTok, YouTube Shorts, and Instagram Reels.

Because these operations modify every pixel, they require full re-encoding, which makes them computationally expensive in FFmpeg. In this post, I will demonstrate how to use the new pad_cuda and the computational gain it introduces.

This post is part of a series of posts about the new FFmpeg 8.0 release:

To install FFmpeg 8.0, follow the instructions for “How to install FFmpeg 8”. The setup used is a Windows 11 Lenovo laptop with Nvidia RTX 4060 - Driver version 581.29, CUDA version 13.0

The following are two examples of how to convert a 1440x1080 Popeye talking video into a vertical video with padding, first using the old method (CPU encoding) and second using the new pad_cuda (GPU encoding) filter.

We can see the significant processing time improvement - running with the GPU is 3.5x faster than the CPU.

CPU:

$INPUT_FILE=”popeye_talking.mp4”
$OUTPUT_FILE=”output_resized_pad_cpu.mp4”

$SCALE_WIDTH=”1080”
$SCALE_HEIGHT=”1920”
$FORCE_ASPECT=”decrease”
$PAD_WIDTH=”1080”
$PAD_HEIGHT=”1920”
$PAD_X=”(ow-iw)/2”
$PAD_Y=”(oh-ih)/2”
$PAD_COLOR=”black”

$SAR=”1:1”

$VIDEO_ENCODER=”libx264”

./ffmpeg -i $INPUT_FILE -vf “scale=w=’$SCALE_WIDTH’:h=’$SCALE_HEIGHT’:force_original_aspect_ratio=’$FORCE_ASPECT’,pad=’$PAD_WIDTH’:’$PAD_HEIGHT’:’$PAD_X’:’$PAD_Y’:color=’$PAD_COLOR’,setsar=’$SAR’” -c:v $VIDEO_ENCODER $OUTPUT_FILE

....
time=00:00:20.91 bitrate=3285.0kbits/s speed=4.95x elapsed=0:00:04.22

GPU-CUDA:

$OUTPUT_FILE=”output_resized_pad_cuda.mp4”

$VIDEO_ENCODER=”h264_nvenc” #Changing to the Nvidia x264 encoder for GPU support

./ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i $INPUT_FILE -vf “scale_cuda=w=’$SCALE_WIDTH’:h=’$SCALE_HEIGHT’:force_original_aspect_ratio=’$FORCE_ASPECT’,pad_cuda=’$PAD_WIDTH’:’$PAD_HEIGHT’:’$PAD_X’:’$PAD_Y’:color=’$PAD_COLOR’,setsar=’$SAR’” -c:v $VIDEO_ENCODER $OUTPUT_FILE

....
time=00:00:20.87 bitrate=2394.6kbits/s speed=16.1x elapsed=0:00:01.29

Note that it is not possible to run pad_cuda with scale (not CUDA), since both filters are running on the same hardware and need to be specified accordingly for the same hardware.

How the scale_cuda (and scale) filter works

The scale filter resizes the video to match a new width and height. In this example:

scale_cuda=w=1080:h=1920:force_original_aspect_ratio=decrease

This resizes the video to fit within 1080×1920 without stretching. FFmpeg will automatically reduce one dimension to preserve the original aspect ratio. For example, a 1920×1080 video will be resized to 1080×810 before padding.

Helpful scale options

Expression

Purpose

scale=w=1080:h=-1

Automatically calculates the height that preserves the aspect ratio, fitting inside a width of 1080

scale=w=1080:h=-2

Like above, but forces dimensions divisible by 2 (required by many encoders)

force_original_aspect_ratio=decrease

Ensures neither dimension exceeds the requested size while preserving the ratio

force_original_aspect_ratio=increase

Expands the image to fill the requested size, then crops or pads as needed

Note: You cannot use scale=w=-1:h=1920 here — FFmpeg would choose a width larger than 1080, which breaks the target output resolution.

How the pad_cuda (and pad) filter works in FFmpeg 8

After scaling, the video may not fill the target resolution completely. The pad filter centers the video and adds borders around it so the final output matches the desired dimensions.

pad=1080:1920:(ow-iw)/2:(oh-ih)/2:color=black

Values are width:height:x:y, where x:y is the top left corner.

This means:

Parameter

Meaning

1080:1920

Final output resolution

(ow-iw)/2:(oh-ih)/2

ow is output width, iw is input width, and correspondingly to height. The original video’s top left corner will be placed such that the video is in the middle, and the padding is around it

color=black

Border background color

You can also use negative offsets for auto-centering for a similar effect:

pad=1080:1920:-1:-1:color=black

setsar

setsar=1:1

SAR = Sample Aspect Ratio. Setting it to 1:1 ensures pixel proportions remain correct and prevents stretched output. (1, 1.0, and 1/1 are equivalent.)

Encoding\Decoding parameters

libx264

A software H.264 encoder, a CPU-based encoder commonly used in FFmpeg commands.

h264_nvenc

NVIDIA’s hardware-accelerated H.264 encoder. It uses the GPU’s dedicated video encoding unit instead of the CPU’s general-purpose cores.

-hwaccel cuda

Enables GPU hardware acceleration using CUDA.

-hwaccel_output_format cuda

Keeps the decoded frames in GPU memory to avoid expensive CPU↔GPU memory copies.

Read more about scaling, padding, and encoding in our FFmpeg cheat sheet

FFmpeg 8.0 (Part 3): Failed attempts to use Vulkan for AV1 Encoding & VP9 Decoding

Peter Naftaliev — Wed, 18 Mar 2026 11:03:02 GMT

Original blog post published in Rendi

FFmpeg 8.0 adds Vulkan support for AV1 encoding and VP9 decoding. Below is an explanation of all these terms, along with my personal experience using these new features in FFmpeg.

This post is part of a series of posts about the new FFmpeg 8.0 release:

Vulkan

Vulkan is a cross-platform, open standard set of APIs that allow programs to use GPU hardware. FFmpeg is built with Vulkan 1.3.

FFmpeg 7.1 and 6.1 already supported Vulkan for H264 and HVEC encoding and decoding, as well as AV1 decoding.

Codecs

H264 and HVEC (also known as H265) are two common codecs used in the common MP4 container.
AV1 is an open-source codec that is gaining popularity.
VP9 is an open source codec owned by Google and used in many Google products (including YouTube)

The new FFmpeg 8.0 Vulkan version supports:

Hardware-accelerated Vulkan AV1 encoding
Hardware-accelerated Vulkan VP decoding
Compute-based Vulkan FFv1 encoding-decoding and ProRes Raw decoding

Hardware-accelerated means that Vulkan utilizes the underlying hardware’s capabilities to process specific video commands, while compute-based means that Vulkan uses compute shaders to perform commands (hardware-agnostic).

To install FFmpeg 8.0 with Vulkan, follow the instructions for “How to install FFmpeg 8”. The setup used is a Windows 11 Lenovo laptop with Nvidia RTX 4060 - Driver version 581.29, CUDA version 13.0

Why is Vulkan interesting?

For developers building video applications that should be supported across different hardware and platforms (for example, FFmpeg’s devs), Vulkan is a way to standardize code so you write it once and it works on Linux, Windows, Nvidia, AMD, Intel, and the rest. You don’t need to write special code for Nvidia’s CUDA or AMD’s VAAPI. Vulkan will help FFmpeg’s maintainers ship cross-platform code faster and more reliably.

If you are a FFmpeg user and want to decode and encode on specialized hardware, you will usually need a different FFmpeg command for each hardware platform (Nvidia CUDA, AMD VA-API, Intel QSV, etc.). With Vulkan, if your platform and hardware support it, you can specify one command and use it across different devices. It is not a large gain, because changing FFmpeg commands for each hardware platform is not a major issue.

What are the limitations of Vulkan?

It is not given that Vulkan will run on your platform. The hardware manufacturer needs to provide the required software drivers for the designated platform to support Vulkan, and you need to ensure the drivers you have installed are correct. For example, Nvidia’s drivers that support Vulkan for Linux and Windows, including their release notes, are here: https://developer.nvidia.com/vulkan-driver

Following are a few issues with Nvidia’s 57X drivers with Vulkan

Good GitHub discussion about Vulkan and stability issues with FFmpeg and MPV player.

To check which hardware and drivers support Vulkan, you can use this community database.

Vulkan doesn’t add functionality that does not exist on the underlying hardware. So older hardware will not support the new encoding/decoding. To check which Vulkan features your local system supports, use the vulkaninfo tool found in the Vulkan SDK. Make sure to use Vulkan V1.3, as that’s the one shipped with FFmpeg.

For example, to check for support for AV1 encoding and decoding on Nvidia GPU:

vulkaninfo.exe
....
GPU id : 0 (NVIDIA GeForce RTX 4060 Laptop GPU):
....
VK_KHR_video_decode_av1                       : extension revision 1
VK_KHR_video_encode_av1                       : extension revision 1

Vulkan performance

There should not be a performance gain, and maybe even a small performance degradation, because your encoding and decoding commands will go through another abstraction layer before reaching the hardware.

Vulkan is another API standard for connecting application code to underlying GPU hardware, like VAAPI and others, as xkcd put it:

My experience with FFmpeg and Vulkan

I tried utilizing the new Vulkan AV1 encoding and VP9 decoding, but unfortunately, these features did not work. Below is the full description of what worked and what didn’t.

All our experiments used the Big Buck Bunny video and its 16-second snippet.

FFmpeg Vulkan commands are of the structure:

ffmpeg -init_hw_device “vulkan=vk:1” -hwaccel vulkan -hwaccel_output_format vulkan ....

To find the correct index of the GPU (vk:1), use the following command:

./ffmpeg -init_hw_device “vulkan” -v verbose
...
[Vulkan @ 000001bdec619e00] Supported layers:
[Vulkan @ 000001bdec619e00]     VK_LAYER_NV_optimus
[Vulkan @ 000001bdec619e00]     VK_LAYER_NV_present
[Vulkan @ 000001bdec619e00] GPU listing:
[Vulkan @ 000001bdec619e00]     0: Intel(R) Arc(TM) Graphics (integrated) (0x7d55)
[Vulkan @ 000001bdec619e00]     1: NVIDIA GeForce RTX 4060 Laptop GPU (discrete) (0x28a0)
[Vulkan @ 000001bdec619e00] Device 0 selected: Intel(R) Arc(TM) Graphics (integrated) (0x7d55)
...

Encode AV1 with FFmpeg Vulkan

./ffmpeg -init_hw_device “vulkan=vk:1” -hwaccel vulkan -hwaccel_output_format vulkan -i big_buck_bunny_720p_16sec.mp4 -c:v av1_vulkan output_av1.mkv
...
[vost#0:0/av1_vulkan @ 0000029a719635c0] Non-monotonic DTS; previous: 125, current: 42; changing to 125. This may result in incorrect timestamps in the output file.
[vost#0:0/av1_vulkan @ 0000029a719635c0] Non-monotonic DTS; previous: 125, current: 83; changing to 125. This may result in incorrect timestamps in the output file.
Unable to submit command buffer: VK_ERROR_DEVICE_LOST
[h264 @ 0000029a6c81e540] [vk @ 0000029a755b9e80] Unable to submit command buffer: VK_ERROR_DEVICE_LOST
[h264 @ 0000029a6c82d240] hardware accelerator failed to decode picture
Unable to submit command buffer: VK_ERROR_DEVICE_LOST
[h264 @ 0000029a74db53c0] get_buffer() failed
[h264 @ 0000029a74db53c0] thread_get_buffer() failed
[h264 @ 0000029a74db53c0] no frame!
Unable to submit command buffer: VK_ERROR_DEVICE_LOST
[h264 @ 0000029a74db5000] get_buffer() failed
[h264 @ 0000029a74db5000] thread_get_buffer() failed
[h264 @ 0000029a74db5000] no frame!
Unable to submit command buffer: VK_ERROR_DEVICE_LOST
[h264 @ 0000029a74db5780] get_buffer() failed
[h264 @ 0000029a74db5780] thread_get_buffer() failed
[h264 @ 0000029a74db5780] no frame!

The command hangs and doesn’t create the desired output.

I was not able to re-encode the H264 video to AV1 using Vulkan, no matter what I tried. I opened this ticket in FFmpeg Forgejo based on Gyan’s instructions.

Running nvenv to re-encode to AV1 worked fine:

./ffmpeg -i big_buck_bunny_720p.mp4 -c:v av1_nvenc output.mkv

If you’re already using av1_nvenc, just keep using it; there’s no reason to switch to Vulkan.

Decoding AV1 with Vulkan (supported in the previous FFmpeg version) and encoding to h264 with CUDA also worked fine:

./ffmpeg -init_hw_device “vulkan=vk:1” -hwaccel vulkan -i big_buck_bunny_720p_av1.mkv -c:v h264_nvenc -c:a aac output.mp4 
...
[out#0/mp4 @ 000001e789517940] video:147136KiB audio:9423KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.259145% frame=14315 fps=769 q=17.0 Lsize=  156965KiB time=00:09:56.33 bitrate=2156.3kbits/s speed=  32x elapsed=0:00:18.61
[aac @ 000001e78efaa100] Qavg: 541.575

Decode VP9 with FFmpeg Vulkan

“Hardware-accelerated VP9 decoding support nowadays is ubiquitous, as most GPUs and SoCs support it natively. Hardware encoding is present in Intel’s Kaby Lake processors and above.”
https://en.wikipedia.org/wiki/VP9

VP9 Vulkan decoding and re-encoding with CUDA:

./ffmpeg  -init_hw_device “vulkan=vk:1” -hwaccel vulkan -i big_buck_bunny_720p_16sec.webm -c:v h264_nvenc -c:a aac output.mp4
...
[opus @ 0000020467207f00] Error parsing Opus packet header.0 bitrate=2207.6kbits/s speed=18.6x elapsed=0:00:00.51
[out#0/mp4 @ 000002046178d840] video:3846KiB audio:256KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.299111%
frame=  384 fps=0.0 q=15.0 Lsize=    4115KiB time=00:00:15.87 bitrate=2123.3kbits/s speed=23.5x elapsed=0:00:00.67
[aac @ 00000204674ffac0] Qavg: 541.625

VP9 Vulkan decoding and re-encoding with CPU:

./ffmpeg  -init_hw_device “vulkan=vk:1” -hwaccel vulkan -i big_buck_bunny_720p_16sec.webm  -c:v libx264   -c:a aac output.mp4
...
[opus @ 000001fe5ed87200] Error parsing Opus packet header.1 bitrate=1336.3kbits/s speed=9.18x elapsed=0:00:01.02
[out#0/mp4 @ 000001fe5990e040] video:2401KiB audio:256KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.450978%
frame=  384 fps=267 q=-1.0 Lsize=    2670KiB time=00:00:15.91 bitrate=1374.0kbits/s speed=11.1x elapsed=0:00:01.43
...

Both commands above created a malformed output video with artifacts:

Re-encoding VP9 to H264 with CUDA worked well, without creating artifacts:

./ffmpeg -init_hw_device cuda -hwaccel cuda -i big_buck_bunny_720p_16sec.webm -c:v h264_nvenc -c:a aac output.mp4
...
[opus @ 0000021cf6b15780] Error parsing Opus packet header.2 bitrate=2178.9kbits/s speed=19.1x elapsed=0:00:00.50
[out#0/mp4 @ 0000021ce78e8c00] video:3718KiB audio:256KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.308708%
frame=  384 fps=0.0 q=12.0 Lsize=    3987KiB time=00:00:15.87 bitrate=2057.5kbits/s speed=24.4x elapsed=0:00:00.65
[aac @ 0000021cf6bb9240] Qavg: 541.625

If you were using CUDA previously, just continue using that and don’t switch to Vulkan.

About my runtime environment:

My setup is a Windows 11 Lenovo laptop with Nvidia RTX 4060 - Driver version 581.29, CUDA version 13.0

Based on this link, you need a minimum Nvidia driver 550.23
Based on the Vulkan community, the minimum Nvidia driver version for Vulkan V1.3 is 527
I tried downgrading our Nvidia driver to version 577, but saw the same results

FFmpeg Vulkan on WSL

To test FFmpeg 8.0 on WSL, I downloaded the N-121064-g424d844534-linux64-gpl build from FFmpeg-Builds, running it on WSL Ubuntu 24.04

FFmpeg Vulkan did not recognize the Nvidia GPU

./ffmpeg -init_hw_device “vulkan” -v verbose
...
[Vulkan @ 0x55f0b44205c0] Supported layers:
[Vulkan @ 0x55f0b44205c0]       VK_LAYER_MESA_device_select
[Vulkan @ 0x55f0b44205c0]       VK_LAYER_MESA_overlay
[Vulkan @ 0x55f0b44205c0]       VK_LAYER_INTEL_nullhw
[Vulkan @ 0x55f0b44205c0] GPU listing:
[Vulkan @ 0x55f0b44205c0]     0: llvmpipe (LLVM 19.1.1, 256 bits) (software) (0x0)
[Vulkan @ 0x55f0b44205c0] Device 0 selected: llvmpipe (LLVM 19.1.1, 256 bits) (software) (0x0)
...

nvidia-smi did recognize the Nvidia GPU:

Running nvenc on the GPU encoding worked fine:

./ffmpeg -i big_buck_bunny_720p.mp4 -c:v av1_nvenc output.mkv

FFmpeg 8.0 (Part 1): Using Whisper for Native Video Transcription in FFmpeg

Peter Naftaliev — Wed, 18 Mar 2026 11:03:00 GMT

Original blog post published in Rendi

The most exciting feature in ffmpeg 8.0 is native support for Whisper, a free and open-source speech recognition library developed by OpenAI. FFmpeg’s Whisper integration enables you to use a single tool for transcribing video, adding subtitles, or automatically extracting highlights. It’s fast enough that you can even do it in real time on a streaming video.

FFmpeg is a free and powerful tool that enables you to easily convert, compress, or transcode nearly any video or audio format with a single command.

This post is part of a series of posts about the new FFmpeg 8.0 release:

This post covers:

FFmpeg + Whisper Demo
Installing FFmpeg 8.0 with Whisper on Windows
Explaining the new FFmpeg 8.0 Whisper filter
Review and benchmarks of Whisper transcription with FFmpeg
Real-time video stream transcription with FFmpeg
Voice activation detection (VAD) in FFmepg

Adding subtitles to a video with two FFmpeg commands

These are the two FFmpeg commands used to add subtitles to the video (link to source video):

./ffmpeg -i popeye_meets_sinbad.mp4 -vn -af “whisper=model=ggml-medium.en.bin:language=en:queue=30:destination=popey_whisper_medium.srt:format=srt” -f null -

./ffmpeg -i popeye_meets_sinbad.mp4 -vf “subtitles=popey_whisper_medium.srt:force_style=’BackColour=&H80000000,BorderStyle=4,Outline=0,Shadow=0,Fontsize=24,MarginV=25’” -c:a copy popeye_meets_sinbad_subtitled.mp4

In this post, I will explain how you can do it yourself.

Why is the Whisper FFmpeg filter interesting?

You can use one tool for transcription and subtitle burning by utilizing the Whisper filter to create an SRT file, a standard subtitle format, which you can then write to a video (burn) with FFmpeg.
Whisper supports WAV and MP3 files. You usually need to install FFmpeg along with Whisper to support different video and audio formats. Having Whisper ship as part of FFmpeg automatically creates support for the various media formats.
It is straightforward to transcribe video streams in near real-time (see example below)
You can use the output from FFmpeg-Whisper, run it through your favorite LLM to extract highlight timestamps from the original video, and use FFmpeg again to trim out clips based on these highlights.

How to install FFmpeg 8 with Whisper on Windows

Getting FFmpeg 8.0

I used the November 15 pre-compiled GPL version of FFmpeg 8.0 with Whisper (and Vulkan) for Windows from FFmpeg-Builds:

Log in to GitHub
Go to https://github.com/BtbN/FFmpeg-Builds/actions
For the latest build, pick the file below named “ffmpeg-win64-gpl”
To check which FFmpeg version this build corresponds to, you can take the seven characters of the commit code, after the ‘g’ character in the build’s version, and insert them into this URL:

https://git.ffmpeg.org/gitweb/ffmpeg.git/blob_plain/1ce88d2:/RELEASE

8.0.git

Those interested in compiling FFmpeg 8.0 themselves, some resources to get you started:

FFmpeg 8.0 is built with a link to the compiled Whisper version (version 1.7.5 or later). Due to the high frequency of Whisper updates, providing FFmpeg with the compiled Whisper version enables updating to future, improved Whisper versions while maintaining the same FFmpeg version. Good Reddit thread about compiling FFmpeg 8 with Whisper
Helpful Reddit thread with compilation issues and instructions
Official compilation docs (a bit outdated) https://trac.ffmpeg.org/wiki/CompilationGuide https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT
All the build information and tutorials I found online were inconsistent, especially with the new introduction of FFmpeg 8.0 + Whisper build. It’s best to reverse-engineer the FFmpeg-Builds compilation script; the code is clear, well-organized, and works.

Get the OpenAI Whisper models

OpenAI’s C/C++ based Whisper open-source project includes language models for transcribing audio to text. It is hosted in ggml’s repository, an open-source machine learning library written in C/C++ with a focus on Transformer inference.

To download the Whisper models, get this script from Whisper.cpp:

# Download base.en medium.en and large models
./download-ggml-model.cmd base.en
...
ggml-base.en.bin

./download-ggml-model.cmd medium.en
...
ggml-medium.en.bin

./download-ggml-model.cmd large-v3
...
ggml-large-v3.bin

About the models

“Models are multilingual unless the model name includes .en. Models ending in -q5_0 are quantized. Models ending in -tdrz support local diarization (marking of speaker turns) using tinydiarize.”

I attempted to run FFmpeg with the Whisper ggml-small.en-tdrz.bin model for speaker recognition, but it did not work as expected. Therefore, I omit this model from the rest of the post.

All the experiments run over a Windows 11 Lenovo laptop with Nvidia RTX 4060 - Driver version 581.29, CUDA version 13.0

Transcribing a video with FFmpeg and Whisper

For testing, I used the video Popeye the Sailor Meets Sinbad the Sailor, released to the public domain by the Library of Congress.

More about the whisper filter and its parameters in FFmpeg’s documentation

Following, I run an FFmpeg Whisper command with the Base English model, sending whisper audio chunks (queue) of 30 seconds and outputting SRT subtitles format to the output destination SRT file:

$INPUT_FILE=”popeye_sinbad.mp4”
$OUTPUT_FILE=”popey_whisper_base.srt”

$MODEL=”ggml-base.en.bin” #whisper model
$LANGUAGE=”en”
$QUEUE=”30” #Size of audio chunks to send to whisper (seconds)
$FORMAT=”srt” #Other possible output formats are json and txt

./ffmpeg -i $INPUT_FILE -vn -af “whisper=model=’$MODEL’:language=’$LANGUAGE’:queue=’$QUEUE’:destination=’$OUTPUT_FILE’:format=’$FORMAT’” -f null -

popey_whisper_base.srt snippet:

0
00:00:00,000 --> 00:00:02,980
[MUSIC PLAYING]

1
00:00:29,994 --> 00:00:32,494
(tense music)

2
00:00:59,988 --> 00:01:06,988
[screaming]

3
00:01:06,988 --> 00:01:08,988
[grunting]

4
00:01:08,988 --> 00:01:11,988
[grunting]

5
00:01:11,988 --> 00:01:15,988
I’m sitting down to say this so hardy and hail.

6
00:01:15,988 --> 00:01:18,988
I live on an island on the back of a whale.

7
00:01:18,988 --> 00:01:22,988
It’s a whale of an island. That’s not a bad joke.

Immediately, you can see that Whisper annotates the sound even if there is no speech, with music, screaming, and other audio artifacts. This works both for the Base and Medium English models. The multilingual large model does not annotate sounds.

To burn the SRT subtitles to the video using FFmpeg you can follow our cheat sheet or use GPT.

More useful flags:

format=json creates a JSON-style file with timestamps and transcription, similarly to SRT.
format=text transcribes the speech to text without timestamps
By default, Whisper uses your system’s GPU. To disable it, specify use_gpu=false. This will result in much slower processing time.

People online remark that whisper models can hallucinate speech when it does not exist in the original audio. I did not see hallucinations in my tests. I have also tested the Big Buck Bunny video, which does not contain any speech, and the results with the ggml-base.en.bin model were very clean, with no hallucinations. Output: JSON file.

In case you see -transcription- hallucinations, you can use the VAD model (explained below)

Performance benchmarks of the three whisper models

Expectantly, the larger the model, the more GPU resources it utilizes:

% of GPU utilization by Whisper model

I also tested the total processing time over the full Popeye video (15:50 minutes) and reported the results as ratios of processing time to video time for GPU-enabled (GPU speed) and GPU-disabled (CPU speed) runs. 1x means processing took 15:50 minutes:

Processing time as a multiplier of video duration

The base and medium English models have audio annotations for different sounds. The large model, which is multi-lingual, doesn’t have annotations; instead, music notes mark for differing sounds - json output of large model popey_whisper_largev3.json

Real-time video stream transcription

You can use FFmpeg to transcribe microphone, HTTP Live Streaming (HLS), and Secure Reliable Transport (SRT) stream.

HLS is a widely used media streaming protocol that delivers video and audio content over standard HTTP web servers. Developed by Apple, it works by breaking a media file or live stream into a sequence of small, downloadable segments (typically a few seconds long) and creating an index file (with a .m3u8 extension) that lists the order and location of these segments.

SRT protocol is used to deliver video and audio with high quality and low latency over unreliable networks like the public internet. SRT streams.
The term is ambiguous, it could mean subtitles format or streaming protocol. I will be using it to mean subtitles format.

Following is a screen recording of real-time SRT transcription of a live HLS video stream:

I started playing the video with FFplay (FFmpeg’s media player) at the same time with the FFmpeg command:

./ffmpeg -live_start_index -1 -i https://livecmaftest1.akamaized.net/cmaf/live/2099281/abr6s/master.m3u8 -vn -af “whisper=model=ggml-base.en.bin:language=en:queue=3:destination=-:format=srt” -f null -

The FFplay command:

ffplay -live_start_index -1 -i https://livecmaftest1.akamaized.net/cmaf/live/2099281/abr6s/master.m3u8

Both commands are running from the last TS chunk of the HLS stream due to the “-live_start_index -1” parameter.

I added “destination=-:format=srt” to output the SRT transcription directly to the terminal.

With queue=3, I experience a 3-second delay in addition to the HLS stream. The transcription processing time for the base model, ggml-base.en.bin, is below 3 seconds, so it is not the delay factor. Over time, because the transcription is so fast, even with the 3-second batch size, the transcription outpaces the video.

Voice activation detection - vad_model

VAD can be beneficial for:

Handling hallucinations - If you find that the speech-to-text model hallucinates text while there is no speech in the video, you can use the VAD model to make sure only to pass audio with speech to the model.
Pre-processing the audio before sending it to the speech-to-text model - For better transcription results. But, judging by what I saw, even the basic model is so good that it is not required to use the VAD model.

You can download the Whisper VAD model using this script.

I attempted to use the VAD model to speed up video stream transcription via audio chunking; however, I observed no increase in transcription speed. With queue=30, the queue parameter still took effect, and each batch took 30 seconds to get transcriptions.

Thanks, Vittorio Palmisano, for a nice first overview of FFmpeg and Whisper capabilities.

Closing remarks

FFmpeg 8.0 includes more filters, encoders, and decoders, as well as security updates. You can read more about them in the official release message and the version 8 changelog.

If you want more info, want us to check something else, or if anything was unclear in the post, let us know in the comments below.

From 2D to 3D Using Neural Nets technical online lecture

Peter Naftaliev — Thu, 18 Jun 2020 08:05:42 GMT

In this talk we present a new artificial intelligence implementation which takes as input a 2D image and automatically reconstructs a 3D model. The reconstruction can happen in any resolution. We see how this same architecture combined with a generative adversarial network (GAN), similar in type to the network use for deep-fake, can be used to generate new 3D models.

We discuss some of the challenges with 3D modelling and AI, we will present cool implementations of AI in visualization, texture analysis and 3D modelling.

PDF of the talk:

2d To 3d Technical Lecture Public

5.76MB ∙ PDF file

Download

During the talk, I followed the implicit decoder research:
Open source code of the research (including trained network and datasets)
https://github.com/czq142857/implicit-decoder

My own two blog posts about the research:

Overview of Human Pose Estimation Neural Networks – HRNet + HigherHRNet, Architectures and FAQ

Peter Naftaliev — Sat, 13 Jun 2020 23:06:56 GMT

High Resolution Net (HRNet) is a state of the art neural network for human pose estimation – an image processing task which finds the configuration of a subject’s joints and body parts in an image. The novelty in the network is to maintain the high resolution representation of the input data and combine it in parallel with high to low resolution sub-networks, while keeping efficient computation complexity and parameters count.

In this post we will cover:

Why HRNet?
HRNet and Architecture
HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
Demo video
Code FAQ

Why HRNet?

Good well documented and maintained open source (link) . 2490 stars on github – of the highest rated around all human pose estimation.
It is used as the backbone for the recent new architectures in the same research space (example in project)
Top competitor in many pose estimation challenges (reference):

#1 on COCO

#2 on PoseTrack2017

#6 on MPII

HRNet Explained

When tackling human pose estimation, we need to be able to detect a person in the image and estimate the configuration of his joins (or keypoints). Therefore, two possible methods exist for pose estimation:

Top-down and bottom-up pose estimations

The bottom-up approach first finds the keypoints and then maps them to different people in the image, while the top-down approach first uses a mechanism to detect people in an image, put a bounding box area around each person instance and then estimate keypoint configurations within the bounding boxes.

Top-down methods rely on a separate person detection network and need to estimate keypoints individually for each person, therefore they are normally computationally intensive because they are not truly end-to-end systems. By contrast, bottom-up methods start by localizing identity-free keypoints for all the persons in an input image through predicting heatmaps of different anatomical keypoints, followed by grouping them into person instances, this effectively makes them much faster.

The top-down approach is more prevalent and currently achieves better prediction accuracy because it separates both tasks to use the specific neural networks trained for each, and because the bottom-up approach suffers from problems with predicting keypoints due to variations in scale of different people in an image (that is, until HigherHRNet appeared – below). This scale variation does not exist in top-down methods because all person instances are normalized to the same scale. While the bottom-up approach is considered to be faster because

HRNet uses the top-down method, the network is built for estimating keypoints based on person bounding boxes which are detected by another network (FasterRCNN) during inference\testing. During training, HRNet uses the annotated bounding boxes of the given dataset.

Two data sets are used for training and evaluating the network

COCO – over 200K images and 250K person instances labeled with 17 keypoints. COCO dataset evaluation requires also evaluating the person bounding boxes, this is done using FasterRCNN network. The evaluation metric is object keypoint similarity (OKS) – a standard keypoint detection accuracy metric.
The MPII Human Pose – around 25K images with 40K subjects. MPII evaluation is done with the annotated bounding boxes from the dataset.

Architecture

Following is the diagram of the neural network, based on the code in the git project, after which is the diagram of the network as depicted in the research paper.

HRNet Network Architecture based on the published open-source

HRNet Network Architecture as presented in the paper

The important structure to notice is that the network calculates the high resolution sub-network (Branch 1) in parallel with lower resolution sub-networks (Branch 2-4). The sub-networks are fused through the fuse layers such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations.

Input image is either 256 x 192 or 384 x 288 with corresponding heatmap output size of 64 x 48 or 96 x 72. The first two convolutions diminish the input size according to the expected heatmap size. The network outputs the heatmap size and 17 channels – value per each pixel in the heatmap per each keypoint (17 keypoints).

The open source architecture depicted is for 32 channel configuration. For 48 channels change every layer starting from first transition layer and forward to 48 and it’s different multiplications by 2.

Exchange block in the paper is a module in the open source, and exchange unit is the fuse layer in the open source. In the paper diagram the transition layer looks like an independent fusion of the sub-networks, while in the code, when creating a lower resolutions (higher channel) sub-network – the transition leading to it is based on the fusion leading to the previous lowest resolution sub-network with another convolution layer. Also, in the open source the fusion of the last layer is only calculated for the high resolution branch (branch 1) and not for all the branches as seen in the paper diagram.

Down-sampling, which is the stride=2 convolutions transferring from high resolution branches to lower resolution branches at the fusion part (or exchange unit), for double and triple down-sampling only enlarges the number of channels only in the last down-sample. This is either a mistake in the code or not explicitly explained in the paper. Most probably mistake in code, since information is not mapped from the larger resolution in deeper channels for the first down-samples – Open issue in git.

If in doubt, use the diagram which is based on the open source – this is the one used when running the trained network.

Network training

For weights initializing the authors trained the same network, with a different output layer on the ImageNet classification dataset and used the weight values as the initialization values for pose estimation training.
Training 210 epochs of HRNet-W32 on COCO dataset takes about about 50-60 hours with 4 P100 GPUs – reference.

HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

This is the same research team’s new network for bottom-up pose tracking using HRNet as the backbone. The authors tackled the problem of scale variation in bottom-up pose estimation (stated above) and state they were able to solve it by outputting multi-resolution heatmaps and using the high resolution representation HRNet provides.

HigherHRNet outperforms all other bottom-up methods on the COCO dataset with especially large gains for medium persons. HigherHRNet also achieves state-of-the-art results on the CrowdPose dataset. The authors state that this suggests bottom-up methods are more robust to the crowded scene over top-down methods, yet there was no comparison to the regular top-down HRNet results on the same dataset.

The backbone for this network is the regular HRNet, but with an added part to the end for outputting higher resolution heatmaps:

The right part of the architecture outputs two heatmap – one for low resolution and one for high – the resolution are 128 x 128 and 256 x 256. During inference both heatmaps are mean aggregated to the higher resolution and the highest valued points are chosen for keypoint detection. The trapezoid is a deconvolution layer which outputs a 2 times higher resolution with 4 residual blocks following. Also, for each keypoint an output scalar tag is calculated, close tag values form a group of keypoints which belongs to a specific person and distant tags values indicate belonging to keypoint groups of different persons. The tags are calculated according to the “Associative Embedding” method described in this paper. The tag values are only trained and predicted for the lowest resolution heatmap because the authors found that empirically higher resolution heatmaps tag values do not learn to predict well and even do not converge.

During training, the loss function is a weighted average of the heatmap prediction loss and the tag values loss (according to the associative embedding method small distance between tags of the same group leads to lower loss and so does higher distance between tags of different groups). Each heatmap resolution loss is calculated independently according to the ground truth and they are sum-aggregated.

Checking the open source code of HigherHRNet there is no inference code available yet to create demo pose estimation videos based on the trained network.

Demo

The demo video is based on the inference script in HRNet (this is an altered script that draws sticks between joins and doesn’t open pop images while running – script link). Credit to Ross Smith’s Youtube channel.

Video characteristics

1920X1080 pixels, 25 frames per second, 56 seconds (1400 frames).
Good examples of multi person, challenging scene – both homogeneous and heterogeneous background, changing background, different camera angles including zoom in and zoom out, and a dwarf in awesome poses.

Runtime information

FasterRCNN with Resnet50 is used for person detection
HRNet with 48 channels and 384 x 288 resolution input image used.
Dell Laptop Core i5-7200, 32GB RAM, GeForce 940MX, Ubuntu 18.04 used. GPU reached 100% utilization during inference.
Average time to track all bounding boxes in a frame: 1.14 sec
Average time for all pose estimations in a frame: 0.43 sec
Average total time for one frame parsing: 1.62 sec
Total time for code to run inference over entire video: 2586.09 sec

Issues in the demo

When evaluating the results of an image processing algorithm, it is important to note where the algorithm did not perform well, this gives clues into its’ inherent issues:

Shirtless people with wooden backgroud are not detected well in FasterRCNN – this might be a training data issue for the FasterRCNN network, not enough shirtless samples or not enough samples with background color similar to person color
Big yellow trampoline detected as person (minutes 00:11) – this might show and inherent problem of FasterRCNN with homogeneous scenes.
17 Keypoints detected in bounding boxes even if there is no person inside the box or not all the joints are showing – HRNet is built in a way that all 17 joints must be predicted, even if they are not visual.
It is worth noting that there is nice pose estimation even with obscuration – in the beginning of the video. Handling missing information in the image due to obscuration is tricky and HRNet is able to tackle this well.
Also worth mentioning that the stick the dwarf holds is not estimated as one of the limbs, which is also a positive sign.

Code FAQ

Pose tracking is done in RGB (https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/41) while person detection baseline trained network is done in BGR (https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/15)
Working with the coco dataset API pycocotools is not compatible with python 3 https://github.com/cocodataset/cocoapi/issues/49 . HRNet mostly works, but once you started playing around with pycocotools, there might be exceptions.
Have to use numpy 1.17: https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/177
How to use your own dataset to train the network: https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/68
In inference consider using model.no_grad to speed up performance and lower mem usage (I haven’t tested it)
Third joint parameter seems to always be zero and for the joints_3d_vis object first two parameters always have the same viability flag, while the third is also zero, from coco.py ->_load_coco_keypoint_annotation_kernal(). Joints are of size 3 as a preparation for affine transform in JoinsDataset -> getitem() -> affine_transform, but the third parameter is never used (maybe it is legacy, or, it was put in place for later use for HigherHRNet). Same thing seems to happen for MPII dataset.
During validation\test there is no use of the annotated joints (even though they are saved in the dataloader pipeline) – the accuracy results printed during the test run are not correct because of it. The entire pipeline of accuracy calculation during the test run is redundant. In the end of the run they use the coco api to calculate the right accuracy measures
Inference is configured with 384X288 (but the Readme says to use 256X192)

Image and joints transforms

demo/inference – box_to_center_scale() scales the image according to the boxes, but it is not clear what pixel_std=200 does. There are several open issues about it:
https://github.com/microsoft/human-pose-estimation.pytorch/issues/26
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/23
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/9
https://github.com/microsoft/human-pose-estimation.pytorch/issues/94 – “I think It is just a hyper parameter representing the default w/h of the bounding box. Just leave it alone.“
center and scale are according to position of detected annotated bbox within the original image. Center is the center of the bbox on the original image and scale should be the size of the bbox relative to the riginal image – from coco.py->_load_coco_person_detection_results(). the bbox is constructed from x, y, w, h = box[:4] (x,y,width,height). When calculating scale, aspect ratio and normalization based on pre-configured pixel_std and 1.25 scale is done.
inference -> get_pose_estimation_prediction returns coords on the original image (there is no rotation, just center and scale of each bounding box )
JointsDataset->getitem->get_affine_transform gets a transformation which enlarges the scale of the original image according to how larger it is than the bbox and then centers the image in the center of the bbox.
Then, warpAffine transfers the original image to be in the center and the scale provided, meaning we should see the area of the bbox in the output image. The output image is cropped, its’ 0,0 point correspond to the point on the original image which after transfer lands on the 0,0 coordinate, the cropping is done moving right and down from that point.
During training the affine transform also has random rotation scaling and flipping class JointsDataset – > __getitem__()
Objects in self.db in JointsDataset are changed by reference. self.db is populated in line 246 of class COCODataset -> _load_coco_person_detection_results().
The transformation calculation is: x_new(of x_old),y_new(of y_old),z = T*(x_old,y_old,1)
Good place to see the example: https://docs.opencv.org/master/dd/d52/tutorial_js_geometric_transformations.html
Joints positions can be negative after transform – they are transferred with the same transformation matrix as the image,and since there is transformation towards the center and enlarging scale according to the bounding box, some joints can be outside the box.
Center and scale annotations for MPII are not completely clear – https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/issues/51

Machine Learning project management – A decision makers’ guide

Peter Naftaliev — Mon, 13 Apr 2020 12:32:54 GMT

Working on many machine learning (ML) projects for many different clients, and discussing the nature of ML project management with other peers and ML specialists we recognized there is sometimes a gap between the expectations of the decision makers who are interested in implementing ML in their business and what can actually be done, at what time range and how much effort and cost it might take. So, we decided to write this guide for managers, CEOs, VP Products, business analysts, startup founders, and, in general anyone who is thinking of hiring in-house or outside help to develop ML algorithms to solve a problem.

In this guide you will learn:

What to expect when embarking on a machine learning project in your company?
What you should be wary of?
Where are the opportunities using machine learning?
What efforts will be required on your team’s part to make it succeed?
How much is a machine learning project going to cost you?
How to recognize good ML engineers?

Some definitions which we will require

Forms of machine learning – The industry trends these days define several different forms of machine learning:
- Deep learning, or, neural network – a form in which a computer is programmed to run in a similar fashion to neuron cells in a biological brain. There is a network of computer programmed neurons connected to each other, created a graph, on one end the network receives an input and on the other end emits an output
- Statistical analysis – These are the old school techniques, for example – regression or anova analysis. Today, in the industry, they are usually considered part of ML.
- Machine learning refers many times to more sophisticated methods of statistical analysis, methods such as SVM, decision trees, clustering algorithms and more. You do not need to know these specific keywords in order to understand the rest of this guide
Feature – Single data point of a sample, or, in other words a specific characteristic of a data sample. Examples:
- Size of an object – Width in meter^2, height in centimeter, etc..
- Categorical measure of an object – Male\Female, Car\Bus\Bike\Truck, etc..
- Price, for example price of a sale in dollars.
- The color value of one pixel (0,0,0) – RGB with 3 features
- Signal measure in 1 time point – Amplitude of sound signal (1db), etc..

Technical considerations

The TL/DR version

Rule of thumb – if a human looking at data and can’t recognize a pattern, ML probably won’t also
Two type of algorithms – requires training or pre-trained
Unsupervised or anomaly detection algorithms rarely work, unless you have very clean data
On the other hand, there are simple implementations for group separation for labeled groups
The more training data the better, the minimal amount of data varies with project requirements and algorithms implemented
Data formatting, examination and transformation is roughly 70% of the work
Deep learning won’t solve your problems, unless maybe if you do vision\signal processing
Killer feature is more important than the algorithm

Rule of thumb – if a human looking at data and can’t recognize a pattern, ML probably won’t also

Before running a machine learning algorithm, it helps if you can visualize the data, see repeating patterns with your own eyes. This could be in the form of graphs, showing a clearly visible trend line, such as on the right.

Some of our clients sometimes develop and in-house rule-based decision machine. For example, if a customer bought a dining table, they would recognize he might also be interested in a chair and put in a rule in the software to offer chairs to customers buying tables. This is good, this means there are indeed repeating patterns in the data.

Machine learning could help you find more patterns you have missed. Or, refine pattern definitions you found, making the pattern recognition more accurate and more actionable for you.

Train or use pre-trained algorithm

ML is known for training periods of the algorithm – You supply your own data, or some other existing dataset and you train the algorithm to recognize patterns in the data you are interested in. Sometimes people use the phrases “supervised” and “non-supervised”:

Supervised means that your training data is classified to different groups. For example, if you develop an algorithm to recognize between photos and cats and dogs, in the supervised method, you will have photos of cats, labeled as cats, and images of dogs, labeled as dogs, and you will train an ML algorithm to recognize cats and dogs based on this training data.

Unsupervised means you have training data, but it is not classified. In the cats and dogs example, you will have photos of cats and dogs but no labeling of which is cat and which is dog. You only know there are 2 possibilities for the label of the photo. In this case, you will train the algorithm to distinguish two groups in your training data.

Notice, both supervised and unsupervised methods require training an algorithm.

The second option is using a pre-trained algorithm, or an algorithm that does not require training. These are pre-prepared algorithms, ready to use. For example, there are already existing algorithms for image recognition, which were trained using huge academia datasets to recognize between different objects in an image. Another example, albeit not exactly machine learning, but an AI method, is textual index and search software. These programs come prepared with the ability to analyze text in different languages, without requiring you to supply training text samples.

The pre-trained approach is more generic and could be implemented in every company quite easily. The issues with this approach are:

Licensing – sometimes you are unable to use a pre-trained algorithm, because it was trained using proprietary data that have license limitations for usage.
The prediction\estimation\classification quality of the algorithm on your data might be worse, because the algorithm was not trained on your data.
Pre-trained algorithms exist only for a specific set of problems and specific constructs of data, many times you might not be able to find a pre-trained algorithm which complies to your exact needs. In contrast, training your own algorithm is very generic and can be used for any required data analysis question.

It is always best to train the algorithm using your own specific and custom data. But, many times, companies don’t have enough data for training, so they are forced to use pre-trained algorithms.

It is worth mentioning, if an implementation of an algorithm worked for another company or in another research, doing the same for your company might require a completely different project, it all depends on the dataset you have.

Unsupervised or anomaly detection algorithms rarely work, unless you have very clean data

These methods are sort of ’machine learning magic’ and should be treated as such. Trying to take a bunch of data and put into an algorithm hoping something good will come out, is, as expected, unique.

Unsupervised learning can work if there is a true different between the different groups we are trying to identify, and this difference clearly shows in the features (Refer to the rule of thumb above about human looking at data). Also, usually we have to know in advance how many different groups we are expected to encounter in the data.

An interesting example of clustering images of handwritten digits to different groups in an unsupervised manner can be seen here:

Taken from this tensorflow link

Anomaly detection could work, if there are enough samples of the unique occasions we are trying to identify, and these samples are indeed much more different than the standard (non-anomalous) situation.

On the other hand, there are simple implementations for group separation for labeled groups

Sometimes it could be quite easy to have samples from different groups and train a machine learning algorithm to recognize a new sample from an unknown group. For example, in psychological behavior studies these techniques have been used by statisticians for many years in order to recognize correlations between different behaviors, combinations of behaviors and group belonging.

How much data will you require for training?

More is always better in this case. Still, to be more concrete, it really changes with the problem you are trying to solve. This is usually one of the parts of the algorithm implementation – examining the specific data at hand, the specific problem to be solved and seeing what algorithms can work and how much data is required.

A common issue arises when there is a large data set, but it is not spread evenly. For example, when a company wants to recognize what features lead to more customer sales conversions, the company might have information about thousands of potential customers but only a handful of customers who bought. In this case, it might be hard if not impossible to run any type of machine learning algorithm to bring meaningful results. In another example, recognizing heavy machinery malfunctions based on sound, pressure measures, temperature or other physical measures, might be impossible if there are a handful of samples when the machine was not working right, and the rest of the samples are when the machine was working good.

Some benchmarks from our own personal experience:

In forecasting and trend analysis, to be able to recognize a seasonal (yearly) trend, a minimum f 2 years of samples is required. This is because broadly 1 year is used for baseline estimation and another year for trend estimation.
In 3D modeling (you can check out our medium post as an example) – At least 5,000 3D models of a specific object are required (5,000 models of chairs)
In video analysis – When we were working on our lip-reading startup, we saw that we need a minimum of 70,000 hours video of people talking (This was about 10 terabytes of data) to get our neural network to learn anything.

It’s not just how much data, it is also how is it formatted

Many times, an ML project starts off with cleaning the provided data, changing it, simplifying its’ structure. In this stage a lot of bugs and issues are found in the data. There might have been an unexpected issue with how the data was originally prepared or saved, or another issue with how the data was exported. All these take time and effort and have to be done very carefully. Because, otherwise we might train an algorithm on completely incorrect information, not get any good results and blame it on the algorithm instead of the original training data.

In projects, 70% of the actual work is at this stage of data re-formatting and testing.

Deep learning won’t solve your problems, unless maybe if you do vision\signal processing

Sometimes clients start talking to us and they will talk about different things they saw online with deep learning and neural networks that they would like us to implement for them. Deep learning is just another type machine learning algorithm to try out. It usually takes more time and effort to construct and optimize a deep learning algorithm to solve a problem than to use something simpler, such as logistic regression or regular regression (depends on the question at hand). In most cases, it is an overkill to implement deep learning. The cases for which deep learning is a must is for technically hard situations, such as image analysis, text analysis, signal processing, biological data analysis or other types of projects in which features are complex and usually there are thousands of features per data sample.

Killer feature is more important than the algorithm

Many times, it is more worthwhile to work on the features, test them out, try to come up with new features. Usually if you have a good feature the simplest algorithm will be enough. For example, in the above example of a customer buying a table will also buy a chair, if you have the feature – the category of the last item sold (in our case, table), then, even a simple logistic regression model might be able to recognize that the next category sold will be chair.

This means that for an ML engineer, it is more important to very good at simple data analysis and data engineering than at knowing all the different ML algorithms and implementing them.

Timelines, pricing and recruiting considerations

Initial data examination takes a minimum of 1 hour for the simplest of cases to two weeks of full-time work.

Time it takes to research specific algorithms or new algorithms released from academia:

At least 4 days to go over the most relevant research papers and information.
Between half a day to 2 weeks to implement basic open source code, if it exists.
Customizing the algorithms or training on your data might take months, depending on the complexity of the problem at hand, the quality of the training data (or, lack of it) and required KPIs.

In some cases, such as anomaly detection, it is impossible to define actionable technical KPIs because it is never clear how accurate even the best algorithm might turn out.

Custom made projects can be expensive, prices range from 120$ to 300$ per hour work.

Usually starting off small, with quick simple wins and then progressing if you see the value in implementing machine learning is the way to go. It is advisable not to spend months developing before you see any progress. Try 1-2 months, focus on an achievable short-term goal, maybe even a simple report, with simple tools. If this works out, then advance to something more sophisticated.

How to recognize the right ML company\consultant\hire

They tell you the same things written above.
They start off with showing you graphs and dashboards over your data instead of diving into developing.
They say no if they recognize the dataset is not good enough, and they give you tips on how you could still do machine learning if you are interested, what you need to focus on.
They are expensive.
They explain basic concepts in simple to understand language to help you understand the project, its’ scope and its’ limitations.

If you have any questions or interest in doing a machine learning project, feel free to contact us on the section to the left of this page.

Also, we have launched datask.co – You ask the data, we answer. This is a machine learning as a service product for those who are interested in implementing machine learning but don’t have the time or resources to do it themselves. Be sure to check it out!

Tensorflow 2 Internals – Lessons learned from creating a 50 hours course

Peter Naftaliev — Mon, 17 Feb 2020 08:36:20 GMT

I was asked to teach a course about a new version deep learning framework by google – Tensorflow version 2 to a company of highly technical people creating smart acoustic and microphone systems for recording studios. Tensorflow 2 came out during the middle of 2019 and is very different than the previous version, so there was a lot of learning to do and trial and error in the face of unknown code, api and bugs.This took me about 3 months of almost full time job to complete. After the course finished I also asked the students to fill in a review about it so I will have their input. During the process I learned a lot about learning, creative work and presentation. I am sharing this here both for my own sake and for anyone else who might benefit from this knowledge.

How to learn new material

Map information sources
1. Go online and filter and map all quality sources of material. When the subject is quite new, there are not so many quality information sources online. Doing a first map of all the good quality sources you found gives access to information and at the same time confidence that you are not going to miss out on information because you didn’t know of another source. For example these sources can be:
  1. Youtube videos
  2. Official documentation
  3. Open source side projects
  4. Books
  5. Online lectures
  6. Conferences\events which have a written online record of what was taught
  7. Stack overflow questions\comments
2. There are sources of sources online – pages that reference to other quality information sources, a sort of dynamic online index. These can be reddit posts, youtube lectures with references (by the google tensorflow team in my case), good github pages and more. Go over all of them and map the most relevant and good information sources you will use.
3. Also possible to ask friends for reference materials – in my case, the subject was so new, almost non of my tech friends could help
4. While you do this first initial mapping, you will inevitably encounter the most significant topics of to learn (and teach), start writing down recurring topic names\chapter or topics which you saw might be extra interesting or relevant even though they are not the main topics.
5. You will also encounter knowledgeable people or specific quality sources, check all their content, you might find more missing gems there.
6. Mark specific references which you might use in whole or part while teaching – either as teaching material or exercise material.
7. In my tensorflow learning I found these resources to be extremely helpful
  1. Google TensorFlow guides and tutorials
  2. Everything by Aurélien Géronhttps://github.com/ageron/tf2_courseHis book: https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
  3. Stanford class https://web.stanford.edu/class/cs20si/
  4. Tensorflow Youtube channel, especially with the “Inside Tensorflow” playlist – https://www.youtube.com/watch?v=yTJ8QydIgVQ&list=PLQY2H8rRoyvzIuB8rZXs7pfyjiSUs8Vza and videos from TF conferences.
  5. These reddit posts
    1. https://www.reddit.com/r/MachineLearning/comments/7u1hki/d_recommendations_for_tutorials_on_tensorflow/
    2. https://www.reddit.com/r/tensorflow/comments/bliecp/best_way_to_learn_tensorflow_20/
  6. This github page with a huge amount of resources and references https://github.com/Amin-Tgz/Awesome-TensorFlow-2/blob/master/README.md
Filtering
1. Briefly check each information source to see if it is good or just someone posting content more for self promotion\marketing and less for giving real value and insight info about a topic
2. make sure you only have the best reference for each sub-topic, If a subtopic has more than 3 information sources, save all of them, use only the best\main 3 and if you need extra info then you have where to look.
3. While you are filtering you will start to learn the basic concepts of what you are learning, you will start to get a feel for what you should focus more on later, what is extra interesting.
4. Going over tensorflow materials I found that a lot of the materials might say they are for tensorflow 2, but actually they are for tensorflow 1. Also, a lot of the materials were brief and shallow explanations without real proper content and value. This filtering step took me about two days to finish.
First reading
1. White papers\Architecture reviews – read thoroughly. I started off my tensorflow learning process by reading the google whitepaper from 2015 for the framework. Interestingly, I also finished learning tensorflow with this white paper, after I went over all the other material and presentations. When I started learning the white paper helped me understand which important/interesting/complicated topics I will encounter and what I should teach. I ended with the white paper because it had in depth explanation about the specifics of the framework which were hard to find in other places.
2. Official documentation – skim. The purpose is:
  1. Check what will be hard and tricky to learn (which means it will consume a lot of time and effort) and what will be easy
  2. Check what are the most important things you should focus on first, what are the basics of the main subject which all other sub topics depend on
3. Do at least one significant, non trivial tutorial, to start “feeling” the topic better and raising more questions.
In-depth reading
1. Once you have listed in order the important things to focus one, start going over them one by one, using the references which you found.
2. This stage can be unexpectedly long and hard, but it is important for your overall familiarity of what you are learning and for your confidence in what you learned. Do not skip and\or rush it.
3. Write down short summaries and the most significant sub-topics you encounter, along with links to the reference materials where you learned of these sub-topics. This will be the basis of what you will be teaching later.
4. If you find a specific sub-topic is taking you longer then expected. Or, maybe the sources you find don’t cover it well. This is a good sign. It means you have found an area where you can learn something new not many people online know and you can teach it later.
  1. If it is an immediately important sub-topic – Focus on it, understand it, if required read the open source until you familiarize yourself with the topic. Take notes during your learning, this will be your source of information (and your audience’s).
  2. If this is not immediately important – write this subtopic down in your open items and make sure you have a task to come back and research this subtopic later
5. While you are learning, you will encounter questions and things you don’t understand. If it is part of your learning flow, go ahead and explore them. If not, write the open questions you have in your notes and continue learning the main thing you set out to learn. These questions are important. Your real understanding comes from asking these questions (which means you understand what you are learning and you try to transfer it to something new that wasn’t completely explained). Also, when later you will answer these question the search for the answer and the final answer will greatly enlarge your overall understanding and confidence in your own understanding.
6. Check back on previously opened questions which might be related to a subtopic you are now learning, Most chances you will be able to answer it. If you see you are still stuck even after the depth reading this might be related to point 4 above, or you might need to learn another sub-topic before you could answer the question.
7. Practice and recall what you learned – Try out things by yourself (your own code) based on what you learned, see if they work or not. Try out special use-cases or or esoteric implementations which you thought of but are not written in your references – this will both help you remember the subject better and will raise issues you are not sure about in your knowledge. Also, this could be good class examples or exercises later.
8. While you are learning, if you encounter open questions people posted which you find answers to – post the answers online, it will help your own understanding, will help others and will promote you.
Points 1-3 can be executed simultaneously, interleaving them until you cover all the interesting references. Point 4 needs to be handled by itself in a focused manner.
Summarize, log or write down things you are not sure of. Later when you learn more deeply you can delete these summaries or change them according to your new found knowledge. You can mark things you are unsure of (I use triple question marks – ???). Also, writing it down allows you to think it through, understand better and remember what you learned (even if partially). Don’t be afraid to “waste” your time writing down something you might delete later.
Bugs in the framework can be very frustrating and time consuming. When a code is not working as expected, make sure it is not working first by minimizing it as much as you can and checking the documentation if it should be working. Then, search google and github to see if anyone else wrote anything about it. If there is nothing written, you will need to double check yourself to make sure it is a bug.
Most of my personal tensorflow learning was through encountering bugs when I was trying to run my own code, from having the class ask me in-depth questions and from just experimenting with different graph structures and different ways to write similar functionalities. For example, everything I was running in eager mode I tried also running in graph mode and vice versa. One of the trickiest part to learn was to try and understand what goes under the hood with tf.data (I used the Inside Tensorflow youtube video about it to great lengths – https://www.youtube.com/watch?v=kVEOCfBy9uY&list=PLQY2H8rRoyvzIuB8rZXs7pfyjiSUs8Vza&index=6). One of the things that wasted a lot of my time was a bug with TF2.1 (which was released while my course was running) and TensorBoard – it wasn’t profiling GPU usage. Took me a full day to accept that this is a bug and not something wrong which I am doing and an extra 1 day (and 300$ cloud server charges) to setup a working environment with TF2.0 instead of TF2.1.

What I learned about preparing presentations for each topic

The cool thing is that if the learning part was done right, this next part is straightforward. You already have most of the materials you want to talk about and present. You also have references to interesting things to show in class and you have a breakdown of the information dependencies between the different subtopics. You also already went through many presentation on the same topics so you can follow their lines if you saw the presentations your good.Along with powerpoint presentations I use jupyter notebook to run python code in class. The jupyter notebook allows to show how code runs in real time, when exceptions occur and why and also allows an interactive way of teaching in which you can challenge the class with questions about how specific code will run and see the results live.
Guidelines:

Put all your summaries and reference materials infront of you in a notebook or in powerpoint
Add any other points you think are relevant for the presentation
At this stage you will see how about how long a presentation for one topic will take
Separate all the material to
1. Presentation – Powerpoint\slides
2. Presentation – colab\jupyter notebook code
3. exercise
4. Bonus information\practice
Fully create the presentation slides
1. Write notes to yourself which you will use in class
2. Especially write notes if you are going to present something that connect to something later on in the presentation
3. Explain every keyword you talk about
Fully create jupyter code notebooks
1. Try to keep text to a minimum, but enough of it so that you will remember what to talk about
2. Make sure all dependencies are installed as part of the notebook execution flow – good both for teaching about dependencies and making sure your code will run on any machine.
3. Separate the topic into sections – help organize thoughts and manage time while presenting (maybe you will want to skip some sections if you run out of time)
4. Along with the jupyter notebook, you should have a small reference notebooks with things you should talk about in class while presenting the notebook.
Repeat 1-6 until you feel that you covered everything for the topic you wanted
Points 1-2 should be done separately to points 3-5. Points 1-2 require imagination, abstract thinking and experimentation, if you combine it with the technical details of creating and editing the presentation you will get frustrated and won’t be able to do an effective job.
While you build your presentation you might find out you are missing knowledge about a specific sub-topic, go back to the in-depth part above and research into that topic.
Create buffer material which you will teach if there is extra time. Your original time estimates will never be completely accurate. If you run out of time you can start off the next class from where you stopped. If you finish before time, the extra buffer material will help you fill that void.

What I learned about presenting

The focus should be on having people understand what you are teaching, not in covering all the materials. It is ok (and advised) to skip\miss parts if it means people will understand what you teach better
Take pauses when presenting, laugh with the audience, tell stories. Actively think in advance what stories you will during the presentation – it both fills time, gives rest and fun
Along with regular break times, add mini breaks during your presentation. In these mini breaks your talk or show something cool which relates to the subject being taught. This both helps to refresh and interest people in what you are teaching further.For example, I showed cool research videos of the latest things in deep learning and deep fake. Later in the reviews the students said this was a cool addition to the course.
If something interests you, it will interest the audience both because it is an interesting topic and because of your personal interest and excitement
About making perfect presentation, error-less, interesting and smooth – The audience is extremely focused in trying to grasp and understand new concepts. They don’t notice the imperfections in your presentation, they notice the imperfections in their understanding of the material. Don’t worry about making it perfect, worry about making it and making it interesting for you while putting as much effort as you can in bringing the audience value.
In my course, I thought I did a poor job with explaining everything. But, the students told me that the tools I gave them will be very valuable for their work, and the code samples were very interesting for them.

About time management in face of unknown materials or task

When you start working on it, there are many things you are not familiar with, many things that can suck up your time without you being prepared. some guidelines on how to manage that:

Briefly skim the material you need to learn\present beforehand to have a rough estimate on how complicated it is. This is why the first stage of mapping your data sources when learning is so important.
If there are many tutorials and explanations about a subject you can estimate time quite accurately. The risk is covering subjects which don’t have a lot of easily accessible information – these usually consume most of the time.
It is sometimes scary or demotivating to work only to find out you have much more work than expected. To handle that, keep a very elaborate task list with priorities and always focus on the small next step, just on what’s the most important next small thing. Start gaining small wins, these will help you get the motivation to cover an unexpectedly large topic.
If you catch yourself studying deeply into a subject that might not be so important, stop. Write it down to check later and address the next subject on the list. You might find a subject is not so important only after you dive into it and find out it is raw, buggy or not the main focus of what you are studying\teaching.
Always remember, you don’t have to know all the exact details of everything. You need to know enough to get by and enough to teach your class.
If you are stuck and can’t progress, a helpful tip is to write down a question for yourself: “why am I stuck?” and answer it with atleast 5 different answers. You will find out exactly what’s blocking your path and will be able to tackle it head on and “unstuck” yourself.

Archiving and references

You should use a notebook or other method of writing and archiving for :
1. Summaries of important, or at first seemingly important, subtopics and subjects with reference to where you learned it
2. Reference materials
3. Ideas to explore\Open questions
4. Tasks
I used 3 evernote notes
1. “resources” – All the references I found and my initial raw ideas
2. “tasks” – The next tasks I should handle and any open items\unclear things I want to learn
3. lecture – The structure of what I will present, the subtopics, main concepts, exercises and bonus references for class

Implicit-Decoder part 2 – 3D generation

Peter Naftaliev — Sat, 16 Nov 2019 17:19:47 GMT

Intro

After Implicit-Decoder part 1 – 3D reconstruction this time talking about 3D generation and limitations for deep learning and 3D.

3D generation

3D Airplane Generation

Remember GANs? Well, this same technique can be used to generate the airplanes you see to the left.

How does it happen? The trick is to use the same decoder network seen below. Specifically the same decoder that was trained along with the encoder. What happens is that we train a GAN network to generate a fake z-vector.

The discriminator gets as input real z-vectors from the encoder-decoder network along with fake z-vectors from the generator network. The generator network is trained to only produce new z-vectors based on a random input. Since the decoder knows to get as input a z-vector and from it reconstruct a 3D model and the generator is trained to produce z-vectors which resemble real ones, new 3D models can be reconstructed using both networks combined.

Also, we can see that the gif shows one airplane model morphing into a new one. This is done by taking the z-vectors for the first and last 3D models, let’s call these vectors z_start and z_end, then new z-vectors are calculated as a linear combination pf z_start and z_end. Specifically, a number (let’s say alpha) between 0 and 1 is picked and then a new z – z_new is calculated: z_new = (z_start*alpha + z_end*(1-alpha)). Then z_new is fed into the decoder network and the interim 3D models can be calculated.

Encoder-Decoder

The reason that there is such a smooth transition between the different 3D models is that the implicit-decoder network is trained to recognize underlying 3D constructs of models based on z-vector, and more specifically, models in a specific model category. Therefore, a small change in z_vector will lead to a small change of the 3D model but still keep the 3D structure of the model category, that way it is possible to continuously change the model from z_start to z_end.

Generating z-vectors

Limitations of deep learning and 3D reconstruction\generation

These results are brought here and other places make neural networks seem to be all-capable, easy to use and generalize to other scenarios, use cases and products. Sometimes this is the case, but many times it is not. In 3D generation and reconstruction there are limitations with neural networks. To name a few:

Dataset limitations

As we have demonstrated, the neural network requires training each time for a specific model category. There need to be enough models (usually minimum hundreds) in each category and enough categories to allow for any type of real life application of this type of neural network, ShapeNet are doing this work for the academic world and even there the amounts of categories and models in every category is limited. To make it commercially viable, we will need more models and categories. Also, each model needs to be labeled to its’ exact category, needs to be aligned translated and scaled accurately, needs to be saved in the right format. In addition, for each model we need images from different angles, different lighting positions and camera parameters, different scales alignments and translations. Again, ShapeNet and other research initiatives help with building this in order to help scientific progress. But, this also means there is a lot of overhead in dataset creation and processing in order to make this research into a product.

Accuracy Measures

A recurring question is how accurate is the 3D reconstruction or generation. A response question to this is – how do you measure accuracy in 3D reconstruction? Let’s say a human 3D designer reconstructs a 3D model from an image, how can we say if his work is accurate or not? Even if we have the original 3D model, how can we say that two 3D models are similar, or that the reconstructed 3D model is similar to the origin, and how can we quantify this similarity?The old school methods, such as MSE, IoU, F1 score, Chamfer and Normal distance [[Add reference – https://2d3d.ai/index.php/2019/10/09/3d-scene-reconstruction-from-single-image/]] are straightforward measures that don’t account for the 3D structure of the object. For example, IoU checks how much of the volume of a reconstructed 3D shape overlaps with the original 3D shape in comparison to the joint volume of both shapes. If the reconstructed shape is moved to be in a different volume in space, the IoU might be zero (because there is no overlap) even if the shapes are identical.

In the implicit decoder paper, the authors use a different measure of 3D shape similarity – LFD. This measure is invariant to scale of model, alignment and position (translation). The basic idea is to take 10 silhouette images of the model from angles on a dodecahedron and 10 different dodecahedrons per model.

10 dodecahedrons

Then, when comparing between two models, compare the visual similarity of the images from these 10 dodecahedrons using Fourier and Zernike coefficients.

References

Implicit Decoder: Chen, Zhiqin, and Hao Zhang. “Learning implicit fields for generative shape modeling.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Shapenet: https://www.shapenet.org/
LFD: Chen, Ding‐Yun, et al. “On visual similarity based 3D model retrieval.” Computer graphics forum. Vol. 22. No. 3. Oxford, UK: Blackwell Publishing, Inc, 2003.

The deep learning dictionary

Peter Naftaliev — Mon, 11 Nov 2019 11:27:30 GMT

In previous and future posts I am referring to different terms from the AI and deep learning world, for example, encoding-decoding in:

In this post we will explain their meaning. This post will be updated constantly to account for more terms that might not be written or new terms that become standard in the industry.

What is encoder-decoder?

Encoder – Maps input data (features) into a different representation. Usually the representation is in a lower space, allowing both for compression of the input data and more efficient representation of the important parts of the input.
Decoder – Maps encoded data into output data. The decoder is trained to understand the underlying representation of the original pre-encoded input data based on the encoded features and can produce output based on this underlying representation.

What is autoencoder?

An autoencoder is a type of encoder-decoder network where the decoder outputs the encoded data back to its’ original input structure. Why is it an interesting type of network?

It allows for compression of data (for example, images) and then reconstructing back the original data with low data loss.
Training and autoencoder is a good method to train an encoder network to be a dimensionality reduction tools (much like PCA) which can represent the input data in a lower dimension vector, keeping the information about the important parts of the data.

Autoencoder

What is RNN (recurrent neural network)?

Neural networks are usually considered as a single input-output calculation. You put input information, the network runs and brings out an output. But, what if we have an input with uncertain size or length. For example, what if the input is textual, and can be constructed from 10 words or another time from 1000 words? And what if the output size is also uncertain? For example, the input text is in English and we want a French translation output which can be constructed from 10 words or another time from 1000 words?

This is where recurrent neural networks come to play. In recurrent neural network the structure of the network allows it to get one piece of input at a time and save history from previous calculations of the network. This can allow our translation example to work as follows: The network gets as an input only one word each time in English and can output only one word each time in French, keeping history of calculations for previous words. This allows for dynamically changing length of input\output while also using the context of previous input\output to understand what should be next. Language sentences usually have a grammatic and semantic structure which should be followed based on previous and future words in the sentence. Other examples of RNNs are: time series predictions, video analysis, movement tracking, robotic sensory input\output and any type of data which comes in a form of a temporal changing sequence.

RNN

What is LSTM (long short term memory)?

LSTM is a specific implementation of an RNN. LSTM keeps the state of the network, along with output and history of previous calculation. In LSTM there are 3 gates which decide how data flows:

Input gate – Calculates how much of the new data (Xt in the diagram) and old output (Ht-1) to add to the new state (Ct)
Forget gate – Calculates how much of the old state (Ct-1) to keep based on the new data (Xt in the diagram) and old output (Ht-1)
Output gate – Calculates how much of the the new state (Ct) to output and also keep as history in this step of the network calculation, based on the new data (Xt in the diagram) and old output (Ht-1)

The input and forget gates are combined to calculate the new state (Ct)

LSTM

What is CNN (convolutional neural network)?

When we look at a visual scene with our eyes, a similar scanning processes occurs in our mind no matter if we are looking at something large, something narrow, a specific part of the scene or the entire scene. Our brain takes a segment from our field of vision and analysis it in a repeating manner – searching for patterns, recognizing constructs, recognizing familiar shapes etc.

CNNs are a method of reconstructing that same process in a computer neural network. Instead of taking each input feature as a unique part of the entire input, the neural network is constructed to look at segments of the input in the same manner, no matter which segment and at the same time understand the construct of the entire input data and how the different features interact with each other.

A common use of CNNs is in visual input analysis (image, video, 3D model etc.). When analyzing an RGB image a CNN scans the input image using a moving window of predefined size and neural network weights. This window shrinks the information it sees each time. The entire image is scanned with this same window, creating a smaller representation of the image. Then again another window with fixed size and neural network weights is scanned over the more compact representation of the image etc. Until, we decide the the representation is compact enough and we add layers that do a specific function we choose. One of the common examples is identifying what is seen in the image, in which case a simple classification neural network structure can be added.

CNN on image

CNNs are important because they allow for a faster and more efficient method to analyze a large sized input (such as large image or video), while also allowing for shift and scale invariance – no matter in which part of the image an object appears, nor it size (as long as it is seen), it can be identified with the same CNN.

What are generative networks?

Generative neural networks are networks which are trained to generate new data, new images, new signals, many times based on a random input. Imagine generating a sentence in French (similarly to what we discussed above), but instead of having an English sentence as input, the input can be anything (even random characters in whatever language) and the output will be a clear sentence in French. By the way, the architecture this French generating network can be exactly the same as the architecture of a network which translates English to French.

What is GAN (generative adversarial network)?

What is similar between the 3 images below? They are all images of non-existing people, generated by a GAN and taken from this website: https://thispersondoesnotexist.com/

GANs – This person does not exist

GANs are a method of creating high quality generative neural networks. In GANs we train two neural networks – a generative network and a discriminator network. The generative network is trained to produce results which can “fool” the discriminator to think this is real data, while the discriminator is trained to recognize what data is fake and what is real. During training the discriminator is fed with real data which is labeled as such, and fake data which is labeled as such. The generator is trained with positive enforcement every time it is able to “fool” the discriminator and negative enforcement every time it is unable. Below is a general overview of a GAN which is trained to generate fake images.

GANs – Fake image architecture

Survival guide for writing a Provisional Patent Yourself

Peter Naftaliev — Sat, 09 Nov 2019 23:10:44 GMT

This post will describe the things I learned about patent writing while writing a provisional patent for an AI algorithm. It is based on my own personal notes when writing the provisional as well as my experience of the process. The post is directed to people who want to defend their invention\IP (intellectual property), people who want to understand where to start and end with provisional patent writing and people who are wondering if to employ a patent writer or do it themselves. I am not a patent lawyer, so do not take my advice as given, consult a patent lawyer for true practitioner input.

Prerequisites (all of these are straightforward and easy to find online explanations):

Basic knowledge of provisional patent – what is it good for, why is it used.
Know the structure required for a provisional patent (summary, description, claims etc.)
Know the difference between a provisional patent and a patent
Do your own patent research and find similar patents in a similar field. This is good both for reference and to understand how you need to write and style your patent

Mystical phrases and what do they mean

Someone who is skilled in the art

This phrase is used to describe someone with knowledge of the domain you are writing the patent for. The phrase is used in order to explain that a certain term you are discussing is familiar in the patent field. For example, the sentence “a neural network gradient decent learning process will be used as is familiar to someone who is skilled in the art” will say that gradient decent is a standard procedure and you didn’t see the benefit of elaborating more as part of your patent application.

Prior art

Anything that is relevant to the subject of your patent which was a previous invention. It could come in a form of a previous patent, scientific publication, media publication, tutorial, youtube video and more.

An embodiment

An embodiment is one realization of the concepts you describe in your patent. For example, if your patent is an algorithm, then an embodiment of this algorithm can be one specific physical implementation of this algorithm on a PC machine, including how it interacts with the input and output devices of the machine. Another embodiment can be an implementation of the same algorithm in the cloud, including how the different cloud components and edge devices communicate to make this algorithm work and be consumed.

Or, if the algorithm has several different parts or options available, an embodiment will be a specific description of which parts are used. For example, if it is a 3D reconstruction algorithm from image and it can get as input the camera parameters, or, it can calculate the camera parameters itself. One embodiment will be a 3D reconstruction algorithm that gets camera parameters as input, another embodiment will be a 3D reconstruction algorithm which calculates camera parameters itself. A third embodiment will be a 3D reconstruction algorithm which can get the camera parameters, checks if the camera parameters are received and are valid, in case they are not received or valid – calculates the camera parameters.

In your patent application you want to have at least one embodiment, preferably at least the one which is the best one in the your eyes (usually this means the broadest one) – called the “preferred embodiment” or “best mode embodiment “. A preferred embodiment demonstrates that the inventor had in mind a specific implementation of his invention at the time of the patent filing, and he was not just making something up which is not feasible to make it work. If it is a software patent, make sure you have at-least one embodiment which specifies the physical implementation of the software – how it runs on machines, what type of machines, what are the required machine characteristics. Preferably put this embodiment in your claims also.

Priority date

The date in which you apply the provisional – this is the one that will be used in your patent’s application date, if you ever submit a non-provisional patent. It is important your priority date will be earlier than the competition, since this is what decides who the invention belongs to. The nice thing is that you can apply the provisional in one jurisdiction (for example, US) and then apply the patent based on this provisional in many jurisdiction (Europe, China etc.) keeping that first priority date. There are extra 60 days after applying the provisional to add all the material which is missing in the provisional and send another provisional, keeping the original priority date. This could extend to 90 days, but unclear exactly how and not guaranteed

Goals of a patent (provisional or not)

The formally stated goals of a patent are:

Securing a priority date for your new invention.
Defending your invention from copies in two possible ways:
1. If you do find someone who copied your invention you can sue them. If they are unable to prove that their copy is actually an original invention of theirs, you will win the lawsuit and be entitled for compensation.
2. Defending from getting sued by previous patent holders. If someone has a patent with an earlier priority date for an invention which is similar to yours but not exact, they might try to sue you by challenging your patent. If your patent indeed has something new over the prior art of the entity suing (this novelty has to appear and be written in your claims) and the court confirms this to be so, you are entitled for your invention and can enjoy its’ fruits.

Notice – your patent and its’ content do not have any other real tangible meaning. So, in case no-one copies your invention, or someone copies and you decide not to take action against it, or, you don’t get sued for infringing on someone else’s IP, everything written in your patent does not really hold any value. In this case, the patent is there much like a security alarm in a private house – it shows you thought of defense, but doesn’t hold any real meaning until a really burglar tries to break in. When the burglar does decide to break in, then the alarm is truly tested to see if it works, or in the patent case, the patent is tested to see if it is defensible.

Once a patent (non-provisional) is published, everyone has access to it and everyone can exactly see what your invention is and how it works. But, presumably, no-one is allowed to copy your invention until the patent expires, that is, if the patent is defensible.

In the startup world, many people will tell you investors want to see that you have patents as an IP asset, which seems to be true when searching for investment. While many entrepreneurs and business people will tell you that patents are not interesting because of three main reasons:

No-one will sue you when you’re a small startup. The only real patent suits happen between big companies (the Googles, Samsungs, Apples of the world) and if you’ve reached a state where you’re big enough for patent lawsuits you’ve done alright for yourself and you’re not really a startup anymore.
Many times patents aren’t really defensible, especially software and chemistry patents – which are easy to tweak a bit and override the patent defenses – making your invention visible to all and not defensible. By the way, legend says this is why Coca-Cola never published their sauce recipe as a patent and keep it secret instead.
When you’re a startup, you got a million other things to worry about, usually regarding immediate business, and thinking 5-10 years ahead what could happen if your invention becomes truly big and successful and you will be in an IP battle, instead of working on it to become big and successful is a waste of resources.

Wait, so why write a software patent?

If your patent is defensible, you got a good IP asset.
You can use the patent as a warning flag for competitors (much like a house alarm). Patent lawsuits can be expensive and energy consuming, so people might be deterred from copying you just because of this threat.
It helps when talking to investors.

Writing guidelines

Claims

Claims is where you distinguish your invention from the rest. When your patent is challenged your claims are checked against the claims of the challenger, if you have a claim that he doesn’t have then your patent can be defended. When writing a claim it is important to describe how the various components are structured and how the various components interact and connect. It is necessary to describe the invention so that it is complete, so that it works, but also so that it is different than what is known in the prior art. In order to define an invention that is new and non-obvious you must include something in the claim that is different than what is found in the prior art.

A claim might be:

Improvement of prior art by adding something to an existing invention.
Improvement of prior art by reducing something from a previous invention, making it simpler, cheaper, lighter etc.
Completely new invention, never existing in part or whole in prior art.

Your aim should be to have some patent claims you think are unique, but which are exceptionally broad. You should also use dependent-claims which are more narrow claims which present a specific version of the invention. Use dependent-claims to describe all the different options of the invention. Also, make sure to write dependent claims which you think represent the best version of the invention. In this way, you have both broad and narrow definitions of your invention, making it more defensible. When you will file a non-provisional this will force the patent office to consider your invention more seriously.

At the beginning of the claims section, start with “I claim,” or “The invention claimed is” and only then start listing all the claims. Each claim must begin with a capital letter and end with a period. Periods may not be used elsewhere in the claims except for abbreviations. What this means is that each claim can be only one sentence. This is true regardless of how tortured the sentence structure is and how incomprehensible the sentence may be to those not trained in patent claim drafting. When drafting a claim start with something like this: 1. A {insert title} comprising: {list the parts one by one} {then explain how each are connected}

Where a claim sets forth a plurality of elements or steps, each element or step of the claim can be separated by a line indentation. It is possible to enumerate the claims with numbers, and reference claims in their dependent-claims using these numbers. Reference characters and numbers from the description and drawings can be used in the claims also, enclosed within parentheses.

Example of the first two claims (regular and dependent) from a patent by google for 3D search (US 8,686,992 B1)
What is claimed is:

1. A computer implemented method of 3D shape retrieval from a query 3D model, comprising: extracting, by one or more processing device, a plurality of features of the query 3D model; generating, by the one or more processing devices, a representation of the query 3D model; calculating, by the one or more processing devices, a first correlation by combining first coefficients associated with the representation of the query 3D model and second coefficients associated with representations of 3D models in the repository to obtain a first output and calculating an inverse rotational Fourier transform of the first output to obtain the first correlation, wherein a number of the first and second coefficients depends on a specified first bandwidth associated with the transform; calculating, by the one or more processing devices, a first similarity score based on the correlation; ranking, by the one or more processing devices, the 3D models based on the first similarity score; calculating, by the one or more processing devices, a second correlation by combining third coefficients associated with the representation of the query 3D model and fourth coefficients associated with representations of 3D models in the repository to obtain a second output and calculating an inverse rotational Fourier transform of the second output to obtain the second correlation, wherein a number of the third and fourth coefficients depends on a specified second bandwidth associated with the trans form, the second bandwidth being higher than the first bandwidth: calculating, by the one or more processing devices, a second similarity score based on the second correlation; ranking, by the one or more processing devices, the 3D models used in the second correlation based on the second similarity score; and returning, by the one or more processing devices, one or more 3D models.

2. The computer implemented method of claim 1, further comprising: determining, by the one or more processing devices, a plurality of matching scores between the query 3D model and the 3D models in the repository for each rotational alignment of the query 3D model; and Selecting, by the one or more processing devices, the high est score from the plurality of matching scores based on the determining step.

Description

Following are things to consider, write, notice and remark on when writing the description part of your patent.

Specification of how the invention\software work:

How the software operates from the perspective of the computer, not the perspective of the user? Describe the overall computer architecture of the system within which the software will exist. Define invention in terms of an overall system that has tangible components. Explain how things will be run, how will the process be implemented (hardware, processor, software architecture). Describe as many tangible things as possible. What are those tangible components? Databases, servers, receivers, transmitters, memory?
Explain each technical details of achieving the goals of the invention in its’ own section.
Describe the desired functionality, including the different paths the process can take (things not working as expected) and then describe how to reach that desired functionality.
How are things connected and interact? What are the alternatives for making, connection, interaction?
The description of a software\algorithm patent should be enough for someone who is skilled in the art – a code developer – to be able to write the code that implements my invention.
If possible, add code\pseudocode samples.

General description writing guidelines:

Write as if the invention is complete and everything was tested and validated.
Explain how the goal of the invention is achieved.
Write simple explanations that a reasonably educated person can understand.
Describe every possible version of the patent, even those that make less sense, as long as they can work in any way:
- Describe the single best and most complete way to make your invention, including any and all options, preferences, constructs, processes and more.
- Describe how to make your invention in a way that leaves out all options, constructs , processes except for those that are absolutely necessary for the invention to work.
- Add best mode embodiment.
Define any term you use exactly, so there won’t be any possible ambiguity. The specification should serve as a glossary to the claim terms so that those who will read the patent can clearly ascertain the meaning of the claim terms.
Explain any non-obvious or counter-intuitive steps, connections or limitations.
Pay particular attention to any preparations that may be necessary prior to beginning the making or using process.
Explain how to use the invention. Think of other ways the invention can be used even if it’s inferior. What are the functions or features that consumers will identify as an advantage?

References to prior art:

Add examples of different previous techniques.
Explain what is specifically unique compared to the prior art.

Photos and diagrams

Flowcharts\Diagrams to prepare:
- A single flowchart that depicts the overall working of the software.
- A series of flow charts that show with painstaking detail the various routines and subroutines that together connect to create and deliver the complete functionality of the computer system as enabled by the software.
25-30 mm spacing from left and right of page.
20-25mm spacing from top and bottom of page.
Number the blocks in the diagram.
Number everything (1/10…).
Better to have black and white images and diagrams than colored.

General patent drafting guidelines

Do no publish anything about your invention on any medium (online, youtube, lecture, document to customer, news article, social media post, etc.) before you have filed the provisional, otherwise when your patent will be challenged it might not be defend-able since the invention was already public knowledge.
A provisional patent is kept private until the final patent is applied. Even then, no one goes over a provisional patent until there comes a time when the patent is challenged. Therefore, write as much as you can and as deeply as you can in the provisional. You could later decide if you want to file the regular patent based on it or not. You could choose if you want to keep the information private or not, but in case you will want to file the final patent an elaborate provisional will help you defend it better.
Anything you write in the provisional patent will be considered in case the patent gets challenged. Especially in the European and Chinese patent offices, if you didn’t specify something, or didn’t elaborate enough in the provisional, no matter if you later did specify in the final patent application, they will consider what you wrote in the provisional patent.
A single provisional patent can be the sources of several filed non-provisional patents, using the priority date of that first provisional patent.
Try to avoid writing relevance terms (such as approximately, closely, substantially). If you have to use them, make sure you define what the relevance means.
Don’t be too definite. For example, instead of saying something like “the only thing that makes the present invention unique is…” It is better to say something like “one of the things that makes the present invention unique is…”
Use the phrasing “one or more” for things that can be singular or plural.
Do not address you invention as simple, instead address it as is elegant.
Before writing the patent description, start with writing all your claims. This will help you understand later what you should be elaborating on and explaining more in the description.
Make sure fonts embedded in the PDF, or use an image PDF instead of textual.
AI Patents in computer vision are trendy nowadays, getting higher approval rates than other software patents. So, if you can set your field of invention to something related to this it might help later with your patent approval. (https://www.kilpatricktownsend.com/en/Insights/Publications/2019/4/PatentingTrendsStudy)
This blog has a lot of good information, but it is very repetitive and there is a lot of marketing content in between, so it takes time to find the valuable pieces: https://www.ipwatchdog.com/2017/05/27/invention-to-patent-101-everything-know-get-started/id=83792/

Concluding note

It’s better to submit something quickly and fix it or add to it later than spend weeks drafting a perfect provisional (which won’t be perfect since you are not a professional accomplished patent writer). Set a specific date for you to finish writing the provisional and stick to it, I would recommend up to 12 days for the entire process if it is your first time (less if you’ve already done this before)

3D search engine

Peter Naftaliev — Sun, 13 Oct 2019 08:55:29 GMT

This time a post about something we developed – a 3D search engine.

User puts in an image of an object, the engine finds in an existing repository of 3D objects the most similar objects to what’s in the photo. In this demo’s repository there are 5000 models of chairs.

How does it work?

The technology works by mapping a 2D image into a representation of the underlying 3D features of the object within the image. Similar features are extracted from the 3D objects. Then, a similarity engine searches the 3D features from the image over the 3D features from the 3D objects.

The demo works for chairs, but this is just a demo The same principle technology could be applied to any type of object category. As long as the 3D objects have a geometric construct (mesh\stl\obj\choose your format) the engine can find it.

Future ideas

Connect this to thingiverse or grabcad or any other online 3D repository and allow for visual search over these repositories.

Implicit-Decoder part 1 – 3D reconstruction

Peter Naftaliev — Fri, 11 Oct 2019 12:34:30 GMT

Back again with another AI and 3D reconstruction post for you This time, a special article, with many cool discoveries, I might write following posts about it. This is the highest quality 3D reconstruction from 1 image research I have seen yet. An encoding-decoding type of neural network to encode the 3D structure of a shape from a 2D image and then decode this structure and reconstruct the 3D shape.

Some details

Input image: 128X128 pixels
Transparent image background
Training and generation is done based on categories of similar objects
Output voxel: Base resolution is 64X64X64 voxels. But, can produce output in any required resolution (!) without retraining the neural network

Neural Network structure:

2D encoder — based on ResNet18. generates and encoding vector of size 128 (z-vector) from an input image
Decoder — simple 6 fully connected layers with 1 classification output neuron. Receives as input the z-vector and –1– 3D coordinate in space and classifies if the coordinate belongs within the mass of the object or not.

Neural Network Structure

How does reconstruction occur from this network?

To reconstruct the entire structure of the object, all 3D coordinates in space are sent to the decoder (in the paper’s case there were 64X64X64 coordinates per object), along with the single z-vector from the image. The decoder classifies each coordinate and creates a representation of the 3D structure. This creates a voxel representation of the 3D object. Then, a marching cube algorithm is used to create a mesh representation.

Example of car category reconstruction

Implicit-decoder 3D reconstruction of car image

First column is the input image, second column is the AI 3D reconstruction and last column is the original 3D object of the car (or, in the technical language — ground truth). The neural network in this image case was trained over models of cars. In the paper there are results for training over chairs, airplanes and more. Notice that the input-output image and voxel resolutions are specific in this paper, but can be changed accordingly for any required implementation.

Wait! How is that last car reconstructed?

The software didn’t even see the front of the car in the image. This is where the power of DL training comes from. Since we train the network over many previous examples of cars, it knows how to extrapolate the shape of a new car it never saw before. The extrapolation is possible because the network is trained over objects from a similar category, so the network effectively reconstructs similar structures it was trained on before which match the structure it sees in the image.

Existing software for 3D reconstruction

Nowadays there are many tools available that do 3D reconstruction from images. These tools use classic photogrammetry techniques to reconstruct a 3D model from multiple images of the same object. Two examples:

This type of software can benefit from the current AI research. Reconstruction of simple planes even if they are not completely seen in the image, handling light reflections or aberrations in the image, better proportion estimations and more. All these can be improved using similar neural network solutions.

3D scene reconstruction from single image

Peter Naftaliev — Wed, 09 Oct 2019 11:37:52 GMT

This paper by Facebook research on how to use neural networks to analyze one image of a scene, segment it into the seen 3D models within it and automatically create meshes\voxels from that single image.
Link to paper: https://arxiv.org/abs/1906.02739

Example of scene 3D reconstruction

Why single image?

Using multiple image will bring better results and reconstruction accuracy, so why use single images only?

It’s easier

Training datasets are more available for single image. The architecture of the neural network is easier to model and explain when it is a single image, it requires less computational resources to train over single image.

It’s more interesting

Once good reconstruction accuracy is reached with a single image, we know that the structure of the neural network is good. It is then possible to change this structure to add more images as input, be it changing the neural network itself, changing the input vector which it receives, averaging over the output of the network or other combination methods. So, actually, multi-image reconstruction is a subgroup of single image reconstruction.

AI and humans

Sometimes people get afraid that AI will replace us all. Well, if it will be able to reach singularity (can read more here https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html ) then yes, it could happen. But, neural networks which transfer 2D images into 3D models is not what’s going to bring this change.
Current technological developments allow to minimize repetitive tasks of humans, and actually facilitate more time, money and energy for creative valuable tasks for humans.

This might lead to a change in the workforce structure in the future, creating new jobs and making older jobs obsolete, but so did the invention of the car (which almost eliminated the use for coachmen but allowed for more accessible transportation and creation of jobs for taxi\bus\truck drivers), the invention of the telegram and many more examples.

Imagine giving a 2D\3D artist a tool in which he can draw whatever shape he likes in 2D and a software can create a corresponding 3D representation, this might open new possibilities both for modelling, for art, for VR\AR, for printing and for other industries that might pop up in the future. Or, just in the short term, making 3D scanning and modelling a much cheaper and faster processes for makers.

Ideas for future research

AI which gets as input a point cloud (instead of image) and reconstructs and accurate 2D mesh
Camera and lighting pose and parameters estimation

Open Source Research – Code reuse

Peter Naftaliev — Mon, 05 Feb 2018 07:00:56 GMT

I’m back!

So, what has been happening the past year? Well, for one thing, we have made progress in our open source research, created the data set, published a paper, I have finished my thesis and more is to come. The following blog posts will be focused on our research.

Back in January 2014 I joined Prof. Neil Gandal, Head of the Economics department and Dr. Uriel Stettner from the Business department in Tel Aviv university to work together on a research involving social science within the open source code development sphere.

The goals of our research is to better understand the open source community, what makes open source projects succeed, what type of commercial companies use open source, where do open source developers choose to work and why, what are the interactions between open source developers and projects and more.

My main focus in the research was to develop and investigate a data set of measures of “information spillovers” between projects. In other words, I constructed a network of flow of code between projects, searching for similar code files and code reuse between projects.

The work comprised of downloading all files from SourceForge from year 1996 to 2015. Then, parsing out the text to a uniform text format from all these files and creating a similarity measure between the different code files. I will explain the technical details on how it was all done in a different post.

Why SourceForge, and not, say, GitHub? SourceSorge exists from the 90s, it was the most popular platform long before GitHub. This allows us to run a thorough social science research over time, to check for temporal and trending changes.

Two main topics when looking for code flows:

Finding similar code files:
We separated code files to programming languages. Within each language, we measured similarity between two files by examining the text of function names, variable names, code fragments and comments within the code. The similarity between documents is based on their joint score in vector space representation. Accordingly, every word in each document is assigned two scores: (1) Term Frequency (TF), which measures the number of times the word appears within the same document compared to all other words in the document and (2) Inverted Document Frequency (IDF), which measures the number of documents in the entire text universe (i.e., all files) in which the particular word appears. Thus, the importance of a word in a document is proportional to its TF score and inversely proportional to its IDF score. For example, in the context of our research, the word “source” is important because it appears many times in this document and does not likely appear in many other economic papers. On the other hand, the word “the” is less important, because it is common in the English language and appears in many other documents. Using a “standard” combined TF-IDF score of each word within a document (file,) we constructed a representation vector of size K, where K is the number of distinct words. Each entry in K is the TF-IDF score of the corresponding word. We then calculated the cosine distance between the vector scores of all pairs of files across projects to determine the similarity between the documents. We chose a minimum cutoff score for similarity, above which, every similarity pair is considered a proper reuse. We checked that score manually over 30 pairs of similar files to make sure that if their score is above the cutoff, they are indeed similar.
Creating a chronological order of the code files:
We looked at the addition date of a file. File X was considered a reuse of file Y if X was similar to Y (similarity score above the cutoff) and X was added to its’ repository after Y. We then constructed a “reuse” connection network between the projects where project A has a directed connection to project B if there is at least one pair of similarity files belonging to these projects such that the original file belongs to A and the destination file belongs to B. Note that if Project B copied from Project A, and Project C copied the same file from project B, project A gets credit as the source in both cases. In this case, project B is just a facilitator and does not get credit as the source.

At the first stage we focused on code files in Java, later we have also constructed the code flow network for C\C++. These are the dominant languages in SourceForge, comprising of 9M and 11M files, respectively. The next common language is C#, with less than 1M files.

Following is an example of how the Java code flow network looked like in 2005. The blue dots are projects and there is an arrow between two dots if code was transferred from one project to another.

Information Flows Thesis Research

Peter Naftaliev — Sun, 04 Feb 2018 10:58:13 GMT

Following is the description of my thesis research, the main topic that I wanted to examine was to see how one’s social network centrality affects his credibility and his ability to spread information across the network. This question has an interesting meaning within professional social networks in which actors are professionals that know how to measure the quality of information they are exposed to from their peers.

In this research, I have conducted the first in depth study of the behavior of information flows in the open source software (OSS) world. I made the first file-level global behavior research, taking advantage of the masses of available online OSS information from SourceForge between the years 2005-2013. I created an information flows network which is based on code copying and reuse across all these years. I combined the information flows network with previous social network researched in (Fershtman and Gandal 2011 and Gandal and Stettner 2014) to combine social network structure and interactions with exact measurement of information flow to figure out novel understanding on the way developers operate in OSS.

Several main questions were asked (and answered) in my research:

Is the centrality of the developer or the project associated with information propagation?
How spread of open source code information looks over time and what is its’ pace?
Is there Two Step Flow of Comomunication (Katz 1957) charactersitic to open source code, in which central actors bring in information to the network from external sources?
What is the reach of different code files over the network and how is it affected by the originator of these files?
Do central actors in the open source social network use their connection to gather code from peers which is relevant to their own projects?

The Dataset:

Using the dataset of code similarities I created an information flow network for each year which was based on the quartet File-Project-Developer-Year.
Each node is a given code file, with outbound connections to other similar code files which were created up to the given year and an inbound connection if the code file was similar to a previously created file. A connection can exist between two nodes if and only if they belong to different projects and were created by different developers and both code files were created up to, including, the given year.

I combined the information flow network with two other network:
(i) Developer network: Two developers are connected in the network if both were members of the same project in the same year.
(ii) Project network: Two projects are connected in the network if both project had a mutual developer in the same year.

Spread of information over time:

Checking the reach of code files, developers and projects, I found there are power law effects taking place, there are a few developers, projects and code files which account for a very large number of copying and reuse. I then checked the speed of information spread via code reuse. I found evidence that correspond to theoretical and empirical social network information flows in which the pace of cumulative information spread in OSS over time is according to an S-Shape curve, or a bell curve if we look at the temporal distribution. Our data set showed an interesting phenomenon, in which the first years of the code existence are most important for its’ spread and later years exhibit smaller code reuse, suggesting technological aging.

Central actors behavior:

My main results showed ambiguity with regard to how much centrality in the social network is associated with being a source and originator of information. The results suggest that what’s more important is the activity of the project itself, and validation that the code itself is indeed valuable through lower modification counts and more years of existence. I saw that the license agreement of a project has a direct effect on the amount its’ code is reused. More permissive license types were associated with greater code spread.
I found that central developers in SourceForge act more as mavens than connectors, bringing new information into the network, rather than connecting developers and projects.

Two Step Communication Flows:

The online environment is ideal for two step communication flows (Katz 1957), because it is a setting in which actors can bring information from other online sources to their own social network, but using our dataset I was able to refute its’ existence in SourceForge.

Concluding thoughts:

I have started the path of understanding and measuring social interactions and information spread in OSS on a large scale. But, I see there is still to be done with regards to homophily, aggregating social characteristics of all peers who reuse a code, understanding the importance of license agreements on code reuse and more.

Open Source Research – Following the Code

Peter Naftaliev — Sat, 03 Feb 2018 07:08:53 GMT

Following is the description of the joint research Prof. Neil Gandal Dr. Uriel Stettner and I were working on. Our main goal was to research how knowledge spillovers between different OSS projects affects projects’ success and progress.

In the case of OSS development, knowledge spillovers (if they exist) likely occur via two channels

Spillovers from Software reuse: Programmers take software code from one project and employ it in another project.
Spillovers from Common Programmers: Programmers take knowledge, know-how, and experience from one or more OSS project they work on and employ that knowledge on another OSS project they work on.

The first channel includes (i) reuse from one project that a programmer is working on to another project he or she is working on as well as (ii) reuse from a project that has no common programmers with the relevant project. The second channel includes knowledge, know-how, and experience, other than software reuse. A key question is whether these spillovers exist in a large OSS network, and if they do, whether knowledge transfer enhances the performance of the projects involved. In previous work we examined how connections among software projects via common programmers affected the success of OSS projects (Fershtman and Gandal 2011; Gandal and Stettner 2016), We found evidence of positive spillovers, but since we could not measure reuse on a large scale, these spillovers include knowledge, know-how, experience, and reuse from other projects the programmer is working on. By directly measuring software reuse as well as network connections we can separately measure the importance of the two channels.

Direct knowledge spillovers occur when two projects have a common programmer who transfers knowledge, know-how and experience embedded in the code from one project to another. In contrast, indirect project spillovers occur when knowledge is transferred from one project to another when the two projects are not directly linked through a common programmer. For example, suppose that programmer “A” works on projects I and II, while programmer “B” works on projects II and III. Programmer A could take knowledge from project I and use it in project II. Programmer B might find that knowledge useful and take it from project II to project III. In such a case, knowledge is transferred from one project to another by programmers who work on more than one project. There is a direct spillover from project I to project II, and an indirect spillover from project I to project III, since projects I and III are not directly connected.

We calculate reuse measures for all projects in our data set, and examine whether reuse of software is associated with project success (controlling for other factors)

We used the same base network and datasets as described in: Information Flows Thesis Research. We constructed a “reuse” connection network between the projects where project A has a directed connection to project B if there is at least one pair of similarity files belonging to these projects such that the original file belongs to A and the destination file belongs to B. Note that if Project B copied from Project A, and Project C copied the same file from project B, project A gets credit as the source in both cases. In this case, project B is just a facilitator and does not get credit as the source. Finally, we then added up all of the connections and defined the variables reuse_in and reuse_out for each project. “Reuse_in” is the number of other projects from which that project reused at least one software file. “Reuse_out,” is the number of projects to which the project “contributed” at least one software file. We also account for investment and effort in the project. Hence, we compute the number of modifications and additions made to the code for each project over the period between 2005 and 2008. A modification is defined as a change made by a programmer to existing code within a distinct file, while an addition occurs when a programmer adds a new file that contains a block of code that was not previously part of a focal OSS project. Thus, a modification captures an activity that affects a particular set of code with the desire to, for example, make the code more efficient or stable. Accordingly, modifications are a good proxy for incremental innovation that, for example, improve how the software product works via the refinement, reutilization, and elaboration of established ideas and technologies. Additions are a proxy for new knowledge that may provide additional functionality (Lewin, Long, & Carroll, 1999)

Our key findings are:

1. Controlling for other factors that explain success, projects that reuse code from a greater number of projects have more success.
2. Even after controlling for software re-use effects, we find knowledge spillovers via common programmers among projects: projects that have more connections are more successful. This suggests that projects receive additional (i.e., non-code) knowledge spillovers from connected projects.

We see that knowledge spillovers take place via both channels discussed above and that both channels (reuse of code and other knowledge spillovers from connected projects) yield spillover benefits.
We then delineate reuse into two categories:

Software reuse from connected projects, i.e., reuse from a project with a contributor in common with the relevant project.
Software reuse from unconnected projects, i.e., reuse from projects without a contributor in common with the relevant project.

We find that reuse from connected projects is not statistically significant in explaining success, while reuse from unconnected projects is statistically significant. Overall, our results suggest that knowledge spillovers from neighboring products are primarily due to knowledge other than copying code, while “reuse” spillovers come from the general community of open source software projects. These results provide the first empirical support for knowledge spillovers via reused code in large open source software networks

Open Source Research – Technical Work

Peter Naftaliev — Thu, 01 Feb 2018 17:20:59 GMT

Crawling

We crawled all projects in SourceForge and saved their entire data from all years starting 1998. There were two main source code repository types used: SVN, CVS. CVS is an older implementation of code management, which is less used today.

Using management apis for SVN and CVS, we first tried creating a program which queries every project for its’ content for every year of existence, taking only the files that changed within the year. This proved to be time consuming. We then changed tactics and just downloaded the entire repositories, virtually creating a copy of all the projects in SourceForge on our database. This was accumulated to about 8 TB of data.

Parsing

Each project is comprised of many different file types which hold relevant information: text code files, textual format documents, word documents, PDFs, compressed files which can contain more documents, configuration files and more. In order to use all this data, we needed to parse all the documents within a project and save them in a uniform format.

We created a program which was able to traverse the directory tree of every open source project, for every year of its’ existence and extract the important information for each file, with the help of the open source project: svn-search . For each file we extracted The text of the file, the author of the file, the last action preformed on the file within the corresponding year (edit, delete, add), time of the last action, the author of the last action, the comments of the last author of the file, size of file, location of the file in the project and more. We needed to adjust our program to handle both SVN and CVS repository types, parsing files which are unique to the ones we saw in SourceForge and saving them in our own XML format. This created a base of 1 TB of standardized data which we could then analyze.

Indexing

We used a Lucene based textual index to index code files for every programming language, focusing on an index for the programing languages Java and C\C++. Our XML formatted files were designated to work seamlessly with the lucene index. The Java source code index is comprised of 9 Million files across 130,000 projects. We divided it to 12 sub indexes (shards) in order to improve search performance, over a cluster of 2 physical servers each with 16 CPU cores, 128GB RAM and 500GB SSD hard disks. We added Java keywords to the list of stop words in Solr, so that these words do not account for when searching similarity.

Clearing Automatically created documents

Many code files are created by automated tools. This creates a bias in the dataset in which we think there was transfer of knowledge between two projects, when in fact, they just used the same automatic tool for code creation. To account for these code files and remove them from our sample, we used Solr again. We looked at samples of extremely similar code files and picked those that we saw were created by automated tools. These code files had textual signatures within them that pointed to their automatic creation. We searched these textual signatures within our code corpus using solr and removed thee files which answered the textual search query.

Data analysis

After we had the textual index we created a MySQL metadata index of these files. We created a Java program which ran similarity searches for all files in Solr and indexed the results in the Database. We then could run fast SQL queries over the data and create python network objects from it.

Social Network Analysis

We had past data regarding the connections of all code developers in Source Forge saved in our MySQL DB. We combined this data with the newly consturcted code flow dataset in order to extract meaningful social network insights.

We used NetworkX package for python to process network characteristics. To get statistical insights we used Pandas a data analysis Python package, R programming language and Stata.

DataHack – FlyCatcher

Peter Naftaliev — Fri, 18 Dec 2015 09:16:40 GMT

A couple of weeks ago some friends and I have participated in a three day programming competition (Hackathon) centered around data mining and machine learning. We won third place and self gratification.

Presenting FlyCatcher

FlyCatcher Team

Bellow is a summary of what we did – Caution, technical stuff

Using historical data about flights, weather and machine learning we are able to predict if a flight will be delayed a day prior to its’ schedule.

We find novel insights relating historical flight delays to future flight delays.

In particular, previous day delays in the airport and airline are found to contain high predictive power for next day flight delays, resulting in 77.6% accuracy of prediction.

Also, tracking flights of distinct airplanes results in 90% predictive accuracy of delays.

We looked at 10 years of flight and weather records.

We downloaded flight data from the USA Bureau of Transportation Statstics. It consisted of 500,000 civilian Inter-American flight details for each month. For weather, we got measurements from all meteorological stations in the US from the National Oceanic and Atmospheric Administration, including all the weather stations that are located within or in close proximity to airports. There was a weather measurement every 2 minutes for each station.

After working and examining the data we decided to focus on all records from 2014.

We started off by mapping busy flight lines across America. In the map bellow, the darker lines are those that have more flights going through them.

Next, we were interested to see the spread of flight delays as percentage of total flights on the same route. The Green lines in the map bellow indicate relatively few delays as part of total flights on the same line, the stronger the color the smaller the percentage of delayed flights. Red lines indicate routes with relatively more delays, the stronger the red the larger the portion of delayed flights as part of total flights on the same route.

Examining delays across airlines and airports we saw that the busier the airport or the larger the airline, the higher the percentage of their delayed flights as a part of their total flight count.

The graphs bellow show all the 14 major American airlines (left) and the hundreds of airports (right) flight delays percentage of their total respective flight counts in all of 2014.

We can see for example that United Arilines (UA) – a very big company – had a much higher percentage of delayed flights than Hawaiin Airlines (HA) – a much smaller company.

Next, we combined weather and flight data to see if there is any connection between the two. In the graph bellow, green vertical lines indicate days of storms (rain, snow, strong winds) across 2014 in Atlanta airport. The blue line is the percentage of delayed flights in a day across the same time. We see that during storm times there are no characterizing peaks of delays at the same time.Getting more towards the machine learning phase, after sorting, filtering through and running some basic prediction models, we extracted the following features vector for each flight in 2014:

“FlightNum”,”CRSDepTime”,”DepDelay”,”CRSArrTime”,”ArrDelay”,

“AirTime”,”Distance”,”date”,”crsdeptime”,”crsarrtime”,”CLD”,”OLD”,

“DLD”,”wind_speed”,”clouds_height”,”temperature”,”AU_Intens”,

“AU_Precip”,”wind_direction”,”visibility”,”pressure”,”AW_Atmos_Cond”,

“AU_Obscur”,”AU_Desc”,”Delay”

Some of these are categorical parameters, for example, AU_Precip is a value out of 7 possible that describes the type of weather around (specifically, what type of precipitation is going on – light or heavy rain, snow, no precipitation and more). Some of the parameters were numerical, for example wind_speed which was measured in meters per second.

All temporal predictive parameters, like weather and conditions the day before were measured the day before each flight. Parameter DepDelay indicated of the time of delay, if it was larger than 15 we considered it as a delayed flight.

After filtering out all the records to have all these fields we were left with about 2250000 records. We took all records of the last 3 months of 2014 and used them as a test set and the 9 first months as training.

The three most interesting parameters we found in this feature vector were:

CLD – Carrier FLight Delays, a measurement of the amount of delays the same airline had the day before.
OLD – Origin FLight Delays, a measurement of the amount of delays the origin airport had the day before.
DLD – Destination FLight Delays, a measurement of the amount of delays the destination airport had the day before.

We also found that weather information did not help improve prediction accuracy and in some instances even reduced accuracy.

Our model checked whether we can predict the binary indicator of DepDelay>=15. We used SVM, RandomForest and the simple logistic regression. The best estimation method was the logistic regression and by far the fastest as well. It resulted in the reported 77.6% prediction accuracy.

The most interesting find we had was that if we can track the physical plane (which is marked by a Tail Number) that will fly, we can achieve a 90%+ predictive power. This is because if a plane is late for his first flight, it will keep being late for the next flights because he has a small amount of time to refuel and continue to the next flight in each destination airport. This chain will only break once the plane had several hours of break in one of its’ destination airport. We tried to predict which tail number is going to be flying the flight, but we found out that it is a task we could not complete within the hackathon time frame. Planes fly to too many destinations and it requires a sophisticated algorithm to predict their location. For example, the airplane in the picture bellow has two main airports it flies to, and from each airport it flies to many other destinations.