A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML

Last update: Jan 1, 2023

Comments

[BUG] BrowserFetcher is still not working on Android

Here is the error I get when using BrowseFetcher I think the error is beacuse of hunit-android

2022-04-12 21:07:05.566 5395-5451/ir.kazemcodes.infinityreader E/AndroidRuntime: FATAL EXCEPTION: DefaultDispatcher-worker-2
    Process: ir.kazemcodes.infinityreader, PID: 5395
    java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner;
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.handleCharacters(HtmlUnitNekoDOMBuilder.java:593)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:303)
        at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source:146)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:289)
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source:0)
        at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.startElement(HTMLTagBalancer.java:812)
        at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.startElement(DefaultFilter.java:140)
        at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2811)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2131)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937)
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443)
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source:5)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:204)
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:298)
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:218)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:686)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:588)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:506)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413)
        at it.skrape.fetcher.BrowserFetcher.fetch(BrowserFetcher.kt:19)
        at org.ireader.presentation.feature_library.presentation.LibraryScreenKt$LibraryScreen$3$2$1$1$1$2$1.invokeSuspend(LibraryScreen.kt:157)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
     Caused by: java.lang.ClassNotFoundException: Didn't find class "java.awt.datatransfer.ClipboardOwner" on path: DexPathList[[dex file "/data/data/ir.kazemcodes.infinityreader/code_cache/.overlay/base.apk/classes4.dex", dex file "/data/data/ir.kazemcodes.infinityreader/code_cache/.overlay/base.apk/classes11.dex", zip file "/data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/base.apk"],nativeLibraryDirectories=[/data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/lib/arm64, /data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/base.apk!/lib/arm64-v8a, /system/lib64, /system/system_ext/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:207)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.handleCharacters(HtmlUnitNekoDOMBuilder.java:593) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:303) 
        at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source:146) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:289) 
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source:0) 
        at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.startElement(HTMLTagBalancer.java:812) 
        at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.startElement(DefaultFilter.java:140) 
        at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.startElement(NamespaceBinder.java:278) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2811) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2131) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937) 
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443) 
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394) 
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source:5) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:204) 
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:298) 
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:218) 
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:686) 
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:588) 
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:506) 
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413) 
        at it.skrape.fetcher.BrowserFetcher.fetch(BrowserFetcher.kt:19) 
        at org.ireader.presentation.feature_library.presentation.LibraryScreenKt$LibraryScreen$3$2$1$1$1$2$1.invokeSuspend(LibraryScreen.kt:157) 
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) 
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

bug

opened by kazemcodes 19

[BUG] Android Studio Project crashes after adding library dependency

Hi! Im trying to use the library in an Android project. As suggested here https://github.com/skrapeit/skrape.it/issues/89 I added this to my build.gradle to avoid having problems with the methods containing space in its names and so.

repositories {
    maven { url "https://jitpack.io" }
}
dependencies {
      testImplementation("com.github.skrapeit:skrape.it:master-SNAPSHOT")
}

And I was able to run this Unit Test successfully

import it.skrape.core.htmlDocument
import it.skrape.matchers.toBe
import it.skrape.matchers.toContain
import it.skrape.selects.html5.h1
import it.skrape.selects.html5.p
import org.junit.Test

import org.junit.Assert.*

/**
 * Example local unit test, which will execute on the development machine (host).
 *
 * See [testing documentation](http://d.android.com/tools/testing).
 */
class ExampleUnitTest {

    @Test
    internal fun `can read and return html from String`() {
        htmlDocument(
            """
        <html>
            <body>
                <h1>welcome</h1>
                <div>
                    <p>first p-element</p>
                    <p class="foo">some p-element</p>
                    <p class="foo">last p-element</p>
                </div>
            </body>
        </html>"""
        ) {

            h1 {
                findFirst {
                    text toBe "welcome"
                }
            }
            p {
                withClass = "foo"
                findFirst {
                    text toBe "some p-element"
                    className toBe "foo"
                }
            }
            p {
                findAll {
                    [email protected] toContain "p-element"
                }
                findLast {
                    text toBe "last p-element"
                }
            }

        }
    }
}

No problem so far, I guess because the Unit Tests run inside JVM.

After that I tried to add the dependency to use it from my Android Project adding this to my build.gradle

implementation("com.github.skrapeit:skrape.it:master-SNAPSHOT")

And received this error:

More than one file was found with OS independent path 'META-INF/DEPENDENCIES'

After a little google search I found this to be the possible solution, so I added this to my build.gradle

android {
    packagingOptions {
        pickFirst "META-INF/DEPENDENCIES"
    }
}

After adding that Im gettin more errors, so I think this might not be the solution Here is the StackTrace when I try to run the app

2020-05-19 01:26:30.841 14644-14644/? I/webscrappertes: Late-enabling -Xcheck:jni
2020-05-19 01:26:30.879 14644-14644/? E/webscrappertes: Unknown bits set in runtime_flags: 0x8000
2020-05-19 01:26:31.299 14644-14644/? W/webscrappertes: Bad encoded_array value: Failure to verify dex file '/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk': Bad encoded_value method type size 7
2020-05-19 01:26:31.306 14644-14644/? E/LoadedApk: Unable to instantiate appComponentFactory
    java.lang.ClassNotFoundException: Didn't find class "androidx.core.app.CoreComponentFactory" on path: DexPathList[[zip file "/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk"],nativeLibraryDirectories=[/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/lib/arm64, /system/lib64, /vendor/lib64, /system/product/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:196)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
        at android.app.LoadedApk.createAppFactory(LoadedApk.java:256)
        at android.app.LoadedApk.createOrUpdateClassLoaderLocked(LoadedApk.java:855)
        at android.app.LoadedApk.getClassLoader(LoadedApk.java:950)
        at android.app.LoadedApk.getResources(LoadedApk.java:1188)
        at android.app.ContextImpl.createAppContext(ContextImpl.java:2462)
        at android.app.ContextImpl.createAppContext(ContextImpl.java:2454)
        at android.app.ActivityThread.handleBindApplication(ActivityThread.java:6353)
        at android.app.ActivityThread.access$1300(ActivityThread.java:220)
        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1860)
        at android.os.Handler.dispatchMessage(Handler.java:107)
        at android.os.Looper.loop(Looper.java:214)
        at android.app.ActivityThread.main(ActivityThread.java:7397)
        at java.lang.reflect.Method.invoke(Native Method)
        at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
        at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:935)
    	Suppressed: java.io.IOException: Failed to open dex files from /data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk because: Bad encoded_array value: Failure to verify dex file '/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk': Bad encoded_value method type size 7
        at dalvik.system.DexFile.openDexFileNative(Native Method)
        at dalvik.system.DexFile.openDexFile(DexFile.java:365)
        at dalvik.system.DexFile.<init>(DexFile.java:107)
        at dalvik.system.DexFile.<init>(DexFile.java:80)
        at dalvik.system.DexPathList.loadDexFile(DexPathList.java:444)
        at dalvik.system.DexPathList.makeDexElements(DexPathList.java:403)
        at dalvik.system.DexPathList.<init>(DexPathList.java:164)
        at dalvik.system.BaseDexClassLoader.<init>(BaseDexClassLoader.java:126)
        at dalvik.system.BaseDexClassLoader.<init>(BaseDexClassLoader.java:101)
        at dalvik.system.PathClassLoader.<init>(PathClassLoader.java:74)
        at com.android.internal.os.ClassLoaderFactory.createClassLoader(ClassLoaderFactory.java:87)
        at com.android.internal.os.ClassLoaderFactory.createClassLoader(ClassLoaderFactory.java:116)
        at android.app.ApplicationLoaders.getClassLoader(ApplicationLoaders.java:114)
        at android.app.ApplicationLoaders.getClassLoaderWithSharedLibraries(ApplicationLoaders.java:60)
        at android.app.LoadedApk.createOrUpdateClassLoaderLocked(LoadedApk.java:851)
        		... 13 more
2020-05-19 01:26:31.330 14644-14644/? I/Perf: Connecting to perf service.
2020-05-19 01:26:31.342 14644-14679/? E/Perf: Fail to get file list cu.neosoft.webscrappertest
2020-05-19 01:26:31.342 14644-14679/? E/Perf: getFolderSize() : Exception_1 = java.lang.NullPointerException: Attempt to get length of null array
2020-05-19 01:26:31.342 14644-14679/? E/Perf: Fail to get file list cu.neosoft.webscrappertest
2020-05-19 01:26:31.343 14644-14679/? E/Perf: getFolderSize() : Exception_1 = java.lang.NullPointerException: Attempt to get length of null array
2020-05-19 01:26:31.396 14644-14644/? D/AndroidRuntime: Shutting down VM
    
    
    --------- beginning of crash
2020-05-19 01:26:31.400 14644-14644/? E/AndroidRuntime: FATAL EXCEPTION: main
    Process: cu.neosoft.webscrappertest, PID: 14644
    java.lang.RuntimeException: Unable to instantiate activity ComponentInfo{cu.neosoft.webscrappertest/cu.neosoft.webscrappertest.MainActivity}: java.lang.ClassNotFoundException: Didn't find class "cu.neosoft.webscrappertest.MainActivity" on path: DexPathList[[zip file "/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk"],nativeLibraryDirectories=[/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/lib/arm64, /system/lib64, /vendor/lib64, /system/product/lib64]]
        at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:3195)
        at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3410)
        at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:83)
        at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135)
        at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95)
        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2017)
        at android.os.Handler.dispatchMessage(Handler.java:107)
        at android.os.Looper.loop(Looper.java:214)
        at android.app.ActivityThread.main(ActivityThread.java:7397)
        at java.lang.reflect.Method.invoke(Native Method)
        at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
        at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:935)
     Caused by: java.lang.ClassNotFoundException: Didn't find class "cu.neosoft.webscrappertest.MainActivity" on path: DexPathList[[zip file "/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk"],nativeLibraryDirectories=[/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/lib/arm64, /system/lib64, /vendor/lib64, /system/product/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:196)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
        at android.app.AppComponentFactory.instantiateActivity(AppComponentFactory.java:95)
        at android.app.Instrumentation.newActivity(Instrumentation.java:1251)
        at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:3183)
        at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3410) 
        at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:83) 
        at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135) 
        at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95) 
        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2017) 
        at android.os.Handler.dispatchMessage(Handler.java:107) 
        at android.os.Looper.loop(Looper.java:214) 
        at android.app.ActivityThread.main(ActivityThread.java:7397) 
        at java.lang.reflect.Method.invoke(Native Method) 
        at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492) 
        at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:935) 
    	Suppressed: java.io.IOException: Failed to open dex files from /data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk because: Bad encoded_array value: Failure to verify dex file '/data/app/cu.neosoft.webscrappertest-hy3VhMUfmFw2eV1pLwg5LQ==/base.apk': Bad encoded_value method type size 7
        at dalvik.system.DexFile.openDexFileNative(Native Method)
        at dalvik.system.DexFile.openDexFile(DexFile.java:365)
        at dalvik.system.DexFile.<init>(DexFile.java:107)
        at dalvik.system.DexFile.<init>(DexFile.java:80)
        at dalvik.system.DexPathList.loadDexFile(DexPathList.java:444)
        at dalvik.system.DexPathList.makeDexElements(DexPathList.java:403)
        at dalvik.system.DexPathList.<init>(DexPathList.java:164)
        at dalvik.system.BaseDexClassLoader.<init>(BaseDexClassLoader.java:126)
        at dalvik.system.BaseDexClassLoader.<init>(BaseDexClassLoader.java:101)
        at dalvik.system.PathClassLoader.<init>(PathClassLoader.java:74)
        at com.android.internal.os.ClassLoaderFactory.createClassLoader(ClassLoaderFactory.java:87)
        at com.android.internal.os.ClassLoaderFactory.createClassLoader(ClassLoaderFactory.java:116)
        at android.app.ApplicationLoaders.getClassLoader(ApplicationLoaders.java:114)
        at android.app.ApplicationLoaders.getClassLoaderWithSharedLibraries(ApplicationLoaders.java:60)
        at android.app.LoadedApk.createOrUpdateClassLoaderLocked(LoadedApk.java:851)
        at android.app.LoadedApk.getClassLoader(LoadedApk.java:950)
        at android.app.LoadedApk.getResources(LoadedApk.java:1188)
        at android.app.ContextImpl.createAppContext(ContextImpl.java:2462)
        at android.app.ContextImpl.createAppContext(ContextImpl.java:2454)
        at android.app.ActivityThread.handleBindApplication(ActivityThread.java:6353)
2020-05-19 01:26:31.400 14644-14644/? E/AndroidRuntime:     at android.app.ActivityThread.access$1300(ActivityThread.java:220)
        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1860)
        		... 6 more

If there is any need for more info I'll be happy to provide it. Thanks in advance !

bug help wanted

opened by javiereugenio 14

[BUG] BrowserFetcher not working on Android

Hey, im new to this, can you help me get the HTML of a whole page, and if you can, also help me parse it into objects?

I basically want to get all the nutritional information in these tables.

And I also need to make sure 100g is selected

Here is my code But it's not working, I get error "No static field INSTANCE..." I'm using your code @here
question

opened by p4ulor 12
[BUG] error while deploying on Android

Hi guys, i just trying this lib for scraping webpage, but i got an error when trying to deploy it on my device

"Space characters in SimpleName 'to be' are not allowed prior to DEX version 040"
bug

opened by glomowa 11

Crash on Android api level 30

Describe the bug Getting crash on Android Level 30 due OkHttp version please update

Code Sample

 suspend fun extract() {
        coroutineScope {
            val extracted = skrape(HttpFetcher) {
                request {
                    url = "SUPER_FANCY_URL"
                }

                extractIt<ScrapSource> {
                    status {
                        it.httpStatusCode = code
                        it.httpStatusMessage = message
                    }
                    htmlDocument {
                        it.allParagraphs = p { findAll { eachText }}
                        it.paragraph = p { findFirst { text }}
                        it.allLinks = a { findAll { eachHref }}
                    }
                }
            }
            _source.postValue(extracted)
        }
    }

Expected behavior Should be able to work in level 30

Additional context I can create a PR with similar change following https://stackoverflow.com/questions/63917431/expected-android-api-level-21-but-was-30

bug

opened by cbedoy 10

[BUG] Unable to crawling the mvnrepository site

Crawling this website in skrape.it will get the wrong HTML, directly in jsoup will get 403, but if via okhttp, everything is normal, can this be solved?

My current solution is:

skrape(OkHttpFetcher) {
  request { url = "https://mvnrepository.com/artifact/kotlin" }
  println(scrape().responseBody)
}

object OkHttpFetcher : NonBlockingFetcher<Request> {
  override val requestBuilder: Request get() = Request()

  @Suppress("BlockingMethodInNonBlockingContext")
  override suspend fun fetch(request: Request): Result = OkHttpClient().newCall(
    okhttp3.Request.Builder()
      .url(request.url)
      .build()
  ).execute().let {
    val body = it.body!!
    Result(
      responseBody = body.string(),
      responseStatus = Result.Status(it.code, it.message),
      contentType = body.contentType()?.toString()?.replace(" ", ""),
      headers = it.headers.toMap(),
      cookies = emptyList(),
      baseUri = it.request.url.toString()
    )
  }
}

bug

opened by chachako 9

[FEATURE] Support for native image (Spring Native/GraalVM)

Is your feature request related to a problem? Please describe. I was wondering if it's bug report or feature request but it would be nice to have support for native image building e.g. Spring Native. Currently skrape.it added as dependency instantly fails build process. It might be connected to usage of logback.xml here. I did small reproduction of this problem with logback and it turned out that it can fail build while having logback.xml in classpath

Describe the solution you'd like Skrape.it supporting native image building.

Additional context

  - Additional action of task ':generateAot' was implemented by the Java lambda 'org.springframework.aot.gradle.SpringAotGradlePlugin$$Lambda$916/0x00000008012f5230'. Reason: Using Java lambdas is not supported as task inputs. Please refer to https://docs.gradle.org/7.5/userguide/validation_problems.html#implementation_unknown for more details about this problem.
I 11:19:13.722 [ld.ContextBootstrapContributor] Detected application class: pl.something.api.ApiApplication
I 11:19:13.724 [ld.ContextBootstrapContributor] Processing application context

org.springframework.boot.logging.LogbackHints$LogbackXmlException: Embedded logback.xml file is not supported yet with Spring Native, read the support section of the documentation for more details

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':generateAot'.
> Process 'command '/Users/user/.sdkman/candidates/java/22.2.r17-grl/bin/java'' finished with non-zero exit value 1

It might be related to https://github.com/spring-projects-experimental/spring-native/issues/625

feature request

opened by marceligrabowski 8

[QUESTION] Socket timeout on self signed SSL certs

Hello! I'm building a simple android app that is going to scrape data from a specific website, and I get socket timeouts on request calls for https sites with self signed certs. I tried a few different sites that have self signed ssl certs and always the same thing happens.

I tried using the sslRelaxed option for the request function and playing around with different timeout values, but I can't make it work at all.

Could someone point me in right direction what could be a problem, and or give me some sample code how to do it in case of self singed certs?

I haven't put a sample code since it is super trivial and similar to samples in the doc., since I just found the skrape.it lib and trying to evaluate it for an app. Thank you!
question

opened by nikoinist 8
[BUG] element extraction methods like `$`, el, element and elements not found
Describe the bug The documentation for extracting data from a website is out of date and does not compile.

Code Sample import it.skrape.extract import it.skrape.selects.$` <-- is not in the selects package and doesn't compile import it.skrape.selects.el <-- is not in the selects package and doesn't compile import it.skrape.skrape

data class MyScrapedData( val userName: String, val repositoryNames: List )

fun main() { val githubUserData = skrape { url = "https://github.com/skrapeit"

extract { MyScrapedData( userName = el(".h-card .p-nickname").text(), repositoryNames = `$`("span.repo").map { it.text() } ) } } println("${githubUserData.userName}'s repos are ${githubUserData.repositoryNames}")

}`

Expected behavior I've tried all but the most basic examples to learn the different components of scraping. selects.element and selects.elements are also used in the examples but they don't appear to be in the code. This very well could be a problem with how I have or haven't configured intellij.
bug
opened by pedramkeyani 7
[IMPROVEMENT] automate release process
Releasing a new version should happen completely automated.

It should happen on pushing a particular tags to the master (since GitHub Actions doesn't support a parametrized manual build trigger).

following tags are allowed values and will trigger a corresponding release (bump and commit project version afterwards publish to maven central):

major (will bump the major version - e.g. 2.11.1 --> 3.0.0 || 2.11.1-alpha1 --> 3.0.0)

feature (will bump the minor version - e.g. 2.11.1 --> 2.12.0 || 2.11.1-alpha1 --> 2.12.0)

bug (will bump the patch version - e.g. 2.11.1 --> 2.11.2 || 2.11.1-alpha1 --> 2.11.2)

alpha (will bump the alpha version - e.g. 2.11.1 --> 2.11.1-alpha1 || 2.11.1-alpha1 --> 2.11.1-alpha2)

beta (will bump the beta version - e.g. 2.11.1 --> 2.11.1-beta1 || 2.11.1-beta1 --> 2.11.1-beta2)

rc (will bump the rc version - e.g. 2.11.1 --> 2.11.1-rc1 || 2.11.1-rc1 --> 2.11.1-rc2)

technical-improvement
opened by skrapeit 7
[BUG] Skrape.It causes a stack dump when trying to run it from an Android application

Describe the bug I get a stack trace when trying to use skrape.it from within an Android app.

Minimal example, gradle.build and a stack trace are in this gist: https://gist.github.com/Git-Jiro/34e7f49d6abddfe825f53cc6df4d4a4d

Expected behavior Scraper should not cause a stack trace. (My code works fine in normal java / kotlin application)
bug help wanted

opened by Git-Jiro 6
[QUESTION] Execution error on some android devices

describe what you want to archive Skrape.it works fine on almost all android devices, but there is a small percentage that generate an exception like this and I don't know how to fix it.

Error report I attach the error that appears. E/System: Uncaught exception thrown by finalizer E/System: java.lang.NullPointerException: Attempt to invoke interface method 'void org.apache.commons.logging.Log.debug(java.lang.Object)' on a null object reference at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.shutdown(PoolingNHttpClientConnectionManager.java:232) at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.finalize(PoolingNHttpClientConnectionManager.java:213) at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:190) at java.lang.Daemons$FinalizerDaemon.run(Daemons.java:173) at java.lang.Thread.run(Thread.java:818) E/AndroidRuntime: FATAL EXCEPTION: DefaultDispatcher-worker-1 Process: dev.jmarin.bibliotecasugr, PID: 3567 java.lang.NoSuchFieldError: No static field INSTANCE of type Lorg/apache/http/message/BasicLineFormatter; in class Lorg/apache/http/message/BasicLineFormatter; or its superclasses (declaration of 'org.apache.http.message.BasicLineFormatter' appears in /system/framework/ext.jar)
question

opened by jesusma3009 0

[BUG] No static field INSTANCE of type Lorg/apache/http/message/BasicLineFormatter

skrapeit-1.3.0-alpha.1

java.lang.NoSuchFieldError: No static field INSTANCE of type Lorg/apache/http/message/BasicLineFormatter; in class Lorg/apache/http/message/BasicLineFormatter; or its superclasses (declaration of 'org.apache.http.message.BasicLineFormatter' appears in /system/framework/org.apache.http.legacy.jar)
		at org.apache.http.impl.nio.codecs.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:53)
		at org.apache.http.impl.nio.codecs.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:57)
		at org.apache.http.impl.nio.codecs.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:47)
		at org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionFactory.<init>(ManagedNHttpClientConnectionFactory.java:75)
		at org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionFactory.<init>(ManagedNHttpClientConnectionFactory.java:83)
		at org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionFactory.<clinit>(ManagedNHttpClientConnectionFactory.java:64)
		at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:688)
		at io.ktor.client.engine.apache.ApacheEngine.prepareClient(ApacheEngine.kt:78)
		at io.ktor.client.engine.apache.ApacheEngine.<init>(ApacheEngine.kt:33)
		at io.ktor.client.engine.apache.Apache.create(Apache.kt:19)
		at io.ktor.client.HttpClientKt.HttpClient(HttpClient.kt:41)
		at it.skrape.fetcher.HttpFetcher.configuredClient(HttpFetcher.kt:28)
		at it.skrape.fetcher.HttpFetcher.fetch(HttpFetcher.kt:24)
		at it.skrape.fetcher.HttpFetcher.fetch(HttpFetcher.kt:20)
		at it.skrape.fetcher.FetcherConverter.fetch(Scraper.kt:30)
		at it.skrape.fetcher.Scraper.scrape(Scraper.kt:17)
		at it.skrape.fetcher.ScraperKt.response(Scraper.kt:87)
		at video.downloader.saver.story.helpers.HtmlDynamicLoader$extract$extracted$1.invokeSuspend(HtmlDynamicLoader.kt:19)
		at video.downloader.saver.story.helpers.HtmlDynamicLoader$extract$extracted$1.invoke(Unknown Source:8)
		at video.downloader.saver.story.helpers.HtmlDynamicLoader$extract$extracted$1.invoke(Unknown Source:4)
		at it.skrape.fetcher.ScraperKt$skrape$1.invokeSuspend(Scraper.kt:43)
		at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
		at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
		at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:279)
		at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
		at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
		at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source:1)
		at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
		at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source:1)
		at it.skrape.fetcher.ScraperKt.skrape(Scraper.kt:42)
		at video.downloader.saver.story.helpers.HtmlDynamicLoader.extract(HtmlDynamicLoader.kt:14)
		at video.downloader.saver.story.ui.fragment.browser.BrowserTabFragment$12.doInUIThread(BrowserTabFragment.java:1030)
		at com.arasthel.asyncjob.AsyncJob$1.run(AsyncJob.java:46)
		at android.os.Handler.handleCallback(Handler.java:938)
		at android.os.Handler.dispatchMessage(Handler.java:99)
		at android.os.Looper.loopOnce(Looper.java:226)
		at android.os.Looper.loop(Looper.java:313)
		at android.app.ActivityThread.main(ActivityThread.java:8751)
		at java.lang.reflect.Method.invoke(Native Method)
		at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:571)
		at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1135)

bug

opened by nikitoSha 0

Multiplatform
I've been working over the last few months to create a multiplatform version for skrape.it It's somewhat in line with #196 but I'm a bit further along in some areas, which is why i wanted to get this out. So far I've converted the buildscripts to multiplatform and implemented some modules in JS. This is still pretty much WIP and I intend to keep working on it. I'll update the pull request as I get further along and improve the code

What's done so far:

Converted buildscripts to multiplatform

Added and implemented JS-Targets for the following modules:

:dsl

:fechter:base-fetcher

:html-parser

:test-utils

Converted the multiplatform modules to use the robstoll/atrium test framework

What still needs to be done:

Convert the rest of the modules

Decide what to do with the different fetchers. Are they really necessary?

Fixup the kover reports (Should be pretty much the same as #196)

Cleanup the code and document it

Migrate the JS Target to the new IR compiler (waiting on atrium for that)

Other notable changes:

Kotlin version was bumped to 1.7.10 and a few other dependecies were updated

Disabled build caching as well as RepositoriesMode.PREFER_SETTINGS since those can unfortunately mess up the builds in multiplatform
opened by McDjuady 6

[BUG] Crash on Android when using R8

Describe the bug When R8 is enabled, I get the exception ExceptionInInitializerError. Here is the stack trace:

java.lang.ExceptionInInitializerError
    at v7.u.b(SourceFile:3)
    at x4.f.b(Unknown Source:2)
    at it.skrape.fetcher.ScraperKt.a(SourceFile:5)
    at com.moefactory.bettermiuiexpress.repository.ExpressRepository$queryExpressDetailsFromCaiNiaoActual$2.t(SourceFile:6)
    at com.moefactory.bettermiuiexpress.repository.ExpressRepository$queryExpressDetailsFromCaiNiaoActual$2.m(SourceFile:2)
    at it.skrape.fetcher.ScraperKt$skrape$1.t(SourceFile:4)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.k(SourceFile:3)
    at v7.y.run(SourceFile:18)
    at kotlinx.coroutines.c.A(SourceFile:21)
    at v7.u.Y(SourceFile:14)
    at it.skrape.fetcher.ScraperKt.b(Unknown Source:8)
    at com.moefactory.bettermiuiexpress.repository.ExpressRepository$queryExpressDetailsFromCaiNiao$1.t(SourceFile:5)
    at com.moefactory.bettermiuiexpress.repository.ExpressRepository$queryExpressDetailsFromCaiNiao$1.m(SourceFile:2)
    at androidx.lifecycle.BlockRunner$maybeRun$1.t(SourceFile:9)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.k(SourceFile:3)
    at v7.y.run(SourceFile:18)
    at y7.e.run(SourceFile:2)
    at z7.h.run(SourceFile:1)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$a.run(SourceFile:15)
    Suppressed: kotlinx.coroutines.DiagnosticCoroutineContextException: [w0{Cancelling}@6e8eabc, Dispatchers.IO]
Caused by: org.apache.commons.logging.LogConfigurationException: java.lang.ClassNotFoundException: Didn't find class "org.apache.commons.logging.impl.LogFactoryImpl" on path: DexPathList[[zip file "/data/app/~~s6TmJS25sPj8Sk_G1Isbhg==/com.moefactory.bettermiuiexpress-cGy7sMs8jCsggIb5mjNEJA==/base.apk"],nativeLibraryDirectories=[/data/app/~~s6TmJS25sPj8Sk_G1Isbhg==/com.moefactory.bettermiuiexpress-cGy7sMs8jCsggIb5mjNEJA==/lib/arm64, /system/lib64, /system_ext/lib64]] (Caused by java.lang.ClassNotFoundException: Didn't find class "org.apache.commons.logging.impl.LogFactoryImpl" on path: DexPathList[[zip file "/data/app/~~s6TmJS25sPj8Sk_G1Isbhg==/com.moefactory.bettermiuiexpress-cGy7sMs8jCsggIb5mjNEJA==/base.apk"],nativeLibraryDirectories=[/data/app/~~s6TmJS25sPj8Sk_G1Isbhg==/com.moefactory.bettermiuiexpress-cGy7sMs8jCsggIb5mjNEJA==/lib/arm64, /system/lib64, /system_ext/lib64]])
    at n9.b.run(SourceFile:48)
    at java.security.AccessController.doPrivileged(AccessController.java:43)
    at n9.d.l(SourceFile:1)
    at n9.d.c(SourceFile:74)
    at n9.d.f(Unknown Source:0)
    at com.gargoylesoftware.htmlunit.WebClient.<clinit>(SourceFile:1)
    ... 19 more
Caused by: java.lang.ClassNotFoundException: Didn't find class "org.apache.commons.logging.impl.LogFactoryImpl" on path: DexPathList[[zip file "/data/app/~~s6TmJS25sPj8Sk_G1Isbhg==/com.moefactory.bettermiuiexpress-cGy7sMs8jCsggIb5mjNEJA==/base.apk"],nativeLibraryDirectories=[/data/app/~~s6TmJS25sPj8Sk_G1Isbhg==/com.moefactory.bettermiuiexpress-cGy7sMs8jCsggIb5mjNEJA==/lib/arm64, /system/lib64, /system_ext/lib64]]
    at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:218)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
    at n9.b.run(SourceFile:2)
    ... 24 more

It seems that some classes are renamed by R8 causing initialization failure.

Code Sample

skrape(BrowserFetcher) {
    request {
        url {
            protocol = UrlBuilder.Protocol.HTTPS
            host = "a.example.com"
            port = -1
            path = "/path/to/query"
        }
        userAgent = "Mozilla/5.0 (Linux; Android 12; M2102K1C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Mobile Safari/537.36 EdgA/105.0.1343.48"
        sslRelaxed = true
    }

    response {
        val jDoc = Jsoup.parse(responseBody)

        // Parse using Jsoup
    }
}

Expected behavior skrape.it could run normally when using R8.

Additional context Maybe adding some proguard rules helps?

bug

opened by Robotxm 2

Three vulnerabilities detected

Hello, Gradle informs me of three vulnerabilities coming from jsoup and xalan :

https://devhub.checkmarx.com/cve-details/CVE-2021-37714/ https://devhub.checkmarx.com/cve-details/CVE-2022-36033/ https://devhub.checkmarx.com/cve-details/CVE-2022-34169/

Have these libs been updated or will be?

Thanks
technical-improvement

opened by Nico-GS 0
Initial Kotlin Multiplatform setup
Initial groundwork for Kotlin Multiplatform #192

Depends on #194

I was expect this to be a lot more difficult! I indended just to do one module, but I found that they were all very easy to migrate. html-parser was the most involved.

That said, I can't run most of the tests (I'm on Windows), so I could have broken some stuff. And the really hard work of actually implementing JS and/or Native code can be done later.

WIP

[x] Migrate test-utils

[x] Update Kover config (or disable Kover if this is too difficult)

[x] ~Configure Maven publishing buildSrc plugin (shouldn't be too much work to do, I can copy & paste some existing config that works)~ I've briefly tested this locally and it seems to work as expected.

[ ] Verify that the new publications are correct and work. This means checking the POMs are correct and expose the right API dependencies.

[x] ~jsExecution feature variant - I can't find an alternative for this with Kotlin Multiplatform. https://youtrack.jetbrains.com/issue/KT-33432. ~ I've simply added the 'maven publishing' config to the browser-fetcher project. I think that will achieve the same result.

Notes

bump Kotlin to 1.7.10, and the language level to 1.7, and - ⚠ breaking change - the api level to 1.5 (from 1.4). Kotlin 1.7 has some nice improvements for Kotlin Multiplatform. And level 1.4 is deprecated. This seemed like a good time to bump it.

JVM only

All tests are still JUnit

I didn't try migrating any code, just moving things into the correct source sets

The real work was creating expect/actual definitions - so check them out and see if they make sense. The expect definitions are essentially like interfaces that the platform code will implement.

The 'JS browser execution' feature probably won't work - I disabled the Gradle option for it

The HttpFetcher and BrowserFetcher objects are pretty redundant, as they don't significantly extend from the BlockingFetcher interface. I think you can refactor the common code to only rely on the interface.

Publishing and releasing are still TODO
opened by aSemy 3