[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
analyze json data with sed help
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 19
Thread images: 8
File: sed error.png (212 KB, 1004x674) Image search: [Google]
sed error.png
212 KB, 1004x674
I need some help with sed. I have a txt file with some sample json data (https://github.com/jmportilla/Reddit-Data-Science-Project/blob/master/my_sample). I want to use sed to find the word "roster" (line 82 in the data set), then locate the author name (in this case "Dmagers") and output it to a file output.text, then jump to the next line and continue looking for more instances of the word "roster" and repeat the above (locate and write out author name) until it reaches the end.

If I run the following in a terminal:

cat /home/fsl/Desktop/data.txt|sed '/body":.*roster[&"]*/!d;s/.*name":"\([^"]*\).*/\1/

I get the error "sed: No Match" despite "roster" being there in line 82 in the data set. Can anyone help me figure out why it isn't working?

Pic related
>>
File: bIPgtan.jpg (80 KB, 720x960) Image search: [Google]
bIPgtan.jpg
80 KB, 720x960
>>51566580
bump
>>
File: 1392046518533.png (664 KB, 1280x720) Image search: [Google]
1392046518533.png
664 KB, 1280x720
Pls /g/, are you there?

You're my only hope.
>>
File: 1386821571441.png (493 KB, 708x664) Image search: [Google]
1386821571441.png
493 KB, 708x664
help help
>>
File: 1385007824104.png (27 KB, 638x547) Image search: [Google]
1385007824104.png
27 KB, 638x547
>>51566750
>>
>>51566580
not sed, but python
this works
with open('C:\\Users\\faggot\\Downloads\\my_sample', 'rb') as infile:
file_as_text = str(infile.read())
split_json = file_as_text.splitlines()
for i in split_json:
j = json.loads(i)
if 'roster' in j['body']:
print('AUTHOR:\n%s\nPOST:\n%s\n' % (j['author'], j['body']))
>>
File: xjfdnm.png (10 KB, 790x196) Image search: [Google]
xjfdnm.png
10 KB, 790x196
>>51566848
picture
I forgot, you also need to
import json
>>
>>51566865
>>51566848
Thanks, you are a saint! Trying it now!
>>
File: ootmgc.png (7 KB, 1098x130) Image search: [Google]
ootmgc.png
7 KB, 1098x130
>>51566980
my bad, you said you wanted to output to output.text (save to a file)
this should work, with no changes needed
just type 'python' in to your terminal (no quotes), and then copy/paste this
import json

with open('/home/fsl/Desktop/data.txt', 'rb') as infile, open('output.text', 'wb') as outfile:
file_as_text = str(infile.read())
split_json = file_as_text.splitlines()
for i in split_json:
j = json.loads(i)
if 'roster' in j['body']:
outfile.write('AUTHOR:\n%s\nPOST:\n%s\n' % (j['author'], j['body']))

this should then produce a text file like this
>>
>>51567045
How fast would you say this is compared to this:

sed -n '/roster/{s/.*,"author":"\([^"]*\)".*/\1/;p;}' /home/fsl/Desktop/data.txt

I have ~1,7 billion lines to run through.
>>
>>51567198
I honestly have no idea, but I'd assume that the sed version would be significantly faster
python is extremely slow compared to C (which is what sed is written in)
but either way, sed is the wrong tool for the job
I see this:
>\([^"]*\)
which is a regular expression
don't extract data from json using a regex, use a json parser
>>
>>51567235
>don't extract data from json using a regex, use a json parser

Do you think you could help me getting up an running with a json parser? I'm completely green as far as how to do that.
>>
regex may be the fastest way though
curl 'http://sprunge.us/HJfO' -o authorname && chmod +x authorname

then
./authorname roster
will output to a file output.txt
>>
>>51567275
>Do you think you could help me getting up an running with a json parser? I'm completely green as far as how to do that.
I'm sorry, but I don't think I'll be much help there
python is great for throwing a working prototype together really quickly, not so much for performance
>>
>>51567374
>then
>./authorname roster
>will output to a file output.txt

When I try that I get: Unmatched ".
>>
>>51566580
Out of curiosity, what is this for, OP?
>>
File: scrot.png (11 KB, 682x182) Image search: [Google]
scrot.png
11 KB, 682x182
>>51567658
werks on my machine
>>
>>51566580
Why would you ever use sed to parse json data?

So really, what you have here is multiple JSON objects (which would make up for broken JSON files, since JSON only allows one top-level structure) that are line separated. The various python examples should serve well.

>>51567198
As for performance, you should try it. I think you'll find using python isn't that slow for this task... if it is, maybe you should use jq(1).
>>
python -m json.tool shit.json | sed "your sed"
Thread replies: 19
Thread images: 8

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.